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Introduction 


1.1.  Introduction 

Mathematical  modeling  is  a  popular  tool  used  to  predict  or  control  an  uncertain  future.  The  complexities 
of  a  real-world  system  are  abstracted  into  variables  linked  by  mathematical  relationships  that  are  hoped  to 
capture  the  essential  behavior  of  the  system. 

Some  models  are  merely  descriptive,  that  is,  they  attempt  only  to  predict  the  course  of  events  based 
on  a  set  of  assumptions.  But  many  models  incorporate  control  variables  that  can  be  adjusted  to  try  to  alfcct 
the  future.  Such  models  are  prescriptive  rather  than  descriptive. 

A  paradigtn  from  economics  that  is  often  used  in  prescriptive  models  is  the  assumption  that  the  controls 
are  adjusted  so  as  to  maximise  a  utility  function  (or  minimise  a  cost,  which  is  effectively  a  negative  utility). 
This  point  of  view  makes  many  prescriptive  models  into  constrained  optimisation  models,  which  are  of 
the  form 

min  /(*) 

(1.1.1) 

subject  to  c(x)  =  0 

where  x  represents  the  control  variables,  f(x)  is  a  “cost”  that  is  to  be  minimised,  and  c(x)  =  0  represents 
the  constraints  imposed  on  the  variables  of  the  model  by  the  structure  of  the  system  under  study. 

Since  models  of  the  general  form  (1.1.1)  have  become  widely  used,  methods  for  solving  them  have  been 
intensively  studied.  A  good  general  reference  for  practical  modern  methods  or  solution  is  Gill,  Murray  and 
Wright  (1981). 

As  the  capacity  and  speed  of  computers  have  dramatically  risen,  our  ability  to  solve  larger  and  larger 
models  has  also  increased.  But  not  all  advances  in  methods  are  attributable  to  the  existence  of  faster 
machines.  Since  the  time  that  the  first  optimisation  models  were  introducer!,  particularly  after  the  advent  of 
linear  programming  in  1947  (sec  Dantsig  (1963)),  it  has  been  recognised  that  real-world  problems  are  usually 
specially  structured.  Clever  algorithms  can  take  advantage  of  special  structures  and  speed  up  the  solution  of 
optimisation  models.  The  increases  in  sise  and  solution  speed  greatly  exceed  the  improvements  in  computer 
hardware. 

One  of  the  earliest  and  most  important  kinds  of  special  structure  that  was  recognised  was  sparsity. 
Roughly  speaking,  sparsity  means  that  a  variable  interacts  directly  with  only  a  few  other  variables.  Thus  in 
(1.1.1),  sparsity  would  mean  that  most  of  the  entries  of  the  Jacobian  of  c  are  scro  for  all  x.  Sparsity  tends 
to  appear  naturally  because  most  variables  in  an  optimisation  model  interact  only  with  other  variables  that 
are  fairly  “close"  in  either  space  or  time.  A  classic  example  is  a  transportation  network,  where  nodes  of  the 
network  interact  only  with  their  immediate  neighbors. 

Sparsity  is  exploited  in  two  ways  by  solution  methods.  First,  since  most  of  the  potential  information 
is  »ero,  the  number  of  pieces  of  data  that  must  be  stored  in  the  computer  is  drastically  reduced.  Second, 
operations  on  sparse  data  can  be  done  faster  by  exploiting  sparsity.  For  example,  in  multiplying  two  sparse 
matrices,  products  involving  a  lero  need  not  be  computed. 

The  importance  of  sparse  methods  is  evidenced  in  many  ways.  Every  commercial  linear  programming 
code  uses  sparse  matrix  methods  in  the  representation  of  the  problem  and  for  handling  its  bases.  The  most 
recent  Sparse  Matrix  Symposium  attracted  119  researchers  who  listened  to  60  presented  papers.  A  keyword 
search  of  the  on-line  Math  Reviews  on  MATMFILK  turned  up  413  reviewed  papers  on  sparsity  since  1972. 

This  thesis  continues  the  trend  of  research  into  better  ways  of  exploiting  sparsity  by  considering  two 
different  problems  in  which  sparsity  is  crucial. 

The  first  problem,  discussed  in  Chapter  2,  is  to  compute  a  finitc-dilTcrcnce  approximation  of  a  sparse 
Hessian  matrix  with  a  minimum  number  of  gradient  evaluations.  For  many  functions,  sparsity  allows  a 
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clever  method  to  approximate  the  Hessian  with  surprisingly  few  gradient  evaluations.  Some  of  the  results 
in  Chapter  2  appeared  in  McCormick  (1981). 

The  second  problem,  discussed  in  Chapter  3,  is  that  of  making  a  given  sparse  matrix  even  sparser, 
perhaps  as  sparse  as  possible.  The  motivation  Ls  that  increased  sparsity  should  load  u>  savings  in  time  and 
storage.  An  earlier  version  of  some  of  the  material  in  Chapter  3  appeared  in  Hodman  and  McCormick  (1982). 

This  technical  report  is  a  reformatted  version  of  McCormick  (1983). 

1.2.  Background  and  Notation 

Though  the  problems  studied  in  this  thesis  appear  to  be  numerical  by  nature,  they  are  both  amenable 
to  a  combinatorial  approach.  Much  of  the  preliminary  material  in  Chapters  2  and  3  is  aimed  at  showing 
the  underlying  combinatorial  nature  of  the  problems.  The  presentation  assumes  a  basic  knowledge  of 
combinatorics,  though  references  are  given  as  new  concepts  are  encountered.  A  basic  reference  for  graph 
theory  is  Bondy  and  Murty  (1976).  Two  good  references  for  combinatorial  optimisation,  and  for  bipartite 
matching  in  particular,  arc  Lawler  (1976)  and  l'apadimitriou  and  SLicglitx  (1982). 

An  important  combinatorial  concept  that  appears  constantly  in  both  chapters  is  the  notion  of  com¬ 
putational  complexity,  which  attempts  to  determine  whether  some  problems  an*  inherently  more  difficult 
to  solve  than  others.  The  most  important  tool  available  for  thus  purpose  is  NP-Completencss.  Roughly 
speaking,  a  problem  X  is  NP-Complcte  if  it  is  as  hard  to  solve  as  any  of  the  well-known  hard  combinatorial 
problems  such  as  the  Traveling  Salesman  Problem  or  the  Craph  Coloring  Problem.  A  good  reference  for 
NP-Completeness  is  Carey  and  Johnson  (1979). 

A  problem  X  is  shown  to  be  NP-Complcte  by  reducing  a  known  NP-Completc  problem  Y  to  X . 
Reducing  Y  to  X  means  that  a  polynomial  way  of  encoding  an  instance  of  Y  into  an  instance  of  X  is 
demonstrated  that  has  the  property  that  solving  the  instance  of  X  also  solves  the  encoded  instance  of  Y. 
Thus  a  fast  (polynomial  time)  algorithm  for  solving  X  would  also  solve  Y  in  polynomial  tune,  so  that  X 
can  be  no  easier  than  Y. 

Strictly  speaking,  it  is  not  known  whether  NP-Complcte  problems  are  actually  harder  than  problems 
that  have  known  polynomial  algorithms,  but  the  conventional  wisdom  among  complexity  theorists  is  that  any 
algorithm  that  solves  such  a  problem  must  take  an  exponential  number  of  steps,  so  that  NP-CompIctcness  is 
tantamount  to  practical  intractability.  This  belief  should  not  be  taken  to  imply  that  there  is  no  hope  of  ever 
making  any  progress  on  an  NP-Complcte  problem.  One  of  the  most  active  areas  of  complexity  research  is  in 
finding  and  analyzing  heuristic  algorithms  for  NP-Complcte  problems  (algorithms  that  only  approximately 
solve  a  problem,  or  efficiently  solve  a  subset  of  instances). 

On  occasions  wc  shall  want  to  distinguish  between  typical  problems  encountered  in  practice  and  ar¬ 
bitrarily  structured  problems.  It  is  a  truism  of  sparsity  research  that  many  practical  problems  have  ill-defined 
additional  structure  that  tends  to  make  them  more  tractable  than,  say,  randomly  generated  problems.  When 
wc  want  to  refer  to  this  phenomenon,  we  shall  write  of  “real”  or  “real-life”  or  “practical”  problems. 

Despite  attempting  to  fully  “combinatorialixe”  our  problems,  at  some  points  their  numerical  properties 
have  to  be  considered.  The  reader  should  be  aware  that  finite-precision  arithmetic  (a  mode!  of  computer 
arithmetic)  differs  in  many  respects  from  “exact"  arithmetic.  The  issues  involved  in  trying  to  maintain 
accuracy  in  numerical  compulations  arc  encompassed  by  the  terms  numerical  stability  and  conditioning. 
An  introduction  to  this  subject  is  given  in  Oahlquist  and  Hjorck  ( 1 974). 

Our  notational  conventions  arc  as  follows.  A  term  Ls  printed  in  bold-face  when  it  is  being  defined,  and 
slanted  type  is  used  for  emphasis.  Capital  letters  A,  U,  . . . ,  arc  used  for  matrices  and  index  sets,  small  letters 
a,  6, ... ,  for  vectors  and  scalars,  and  script  letters  P,  ,  for  graphs  and  matchings.  Theorems,  equations 

and  tables  are  all  referred  to  by  a  three  part  number  of  the  form  z.y.z,  which  means  the  2th  occurrence  of 
that  object  in  major  section  x.y.  The  end  of  a  proof  is  marked  by  the  symbol  “0". 
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2.1.  Introduction  to  Sparse  Hessians 

In  numerical  optimisation  procedures  it  is  sometimes  necessary  to  evatuatc  the  Hessian  matrix 


of  a  function  F:  R"  -»  R.  It  is  usually  preferable  to  evaluate  f!(x°)  analytically,  but  it  is  not  always  possible 
to  do  so.  For  instance,  F  may  not  be  known  analytically  (if,  say,  F  is  the  output  of  a  simulation),  F  inay 
be  of  a  form  that  makes  //  very  complicated  to  evaluate  analytically,  or  the  user  of  an  optimisation  routine 
may  simply  bo  unwilling  to  provide  an  evaluation  routine  for  II .  These  considerations  make  it  useful  for  the 
designer  of  a  “black  box”  optimisation  routine  to  include  a  facility  for  approximating  If  by  finite-differencing. 

We  shall  assume  that  there  is  a  way  to  evaluate  the  gradient  of  F,  call  it  g(x)  =  ( ,  ■  ■  •  ,  f°r 

use  in  finite-differencing.  The  fundamental  fact  that  is  used  in  finite-differencing  is  that  if  d  is  a  Suitably” 
small  (see  (Jill,  Murray  and  Wright  (1981),  Sections  4.6.1  and  8.6,  for  a  discussion  of  difference  interval  sites) 
perturbation  vector,  then  differencing  g  along  direction  d  gives  the  linear  equations 

dT ff(x°)  —  g(x0  +  d)  -  g(x°),  (2.1.1) 

where  /7(x°)  is  an  approximation  to  //(x°).  Note  that  the  right  hand  side  of  (2.1.1)  is  calculated  from  the 
gradient  evaluation  routine,  the  “unknowns”  are  the  entries  or  II,  and  (2.1.1)  is  a  system  of  n  equations,  one 
for  each  component  of  the  gradient. 

The  most  common  and  straightforward  method  of  approximating  //(x°)  is  successively  to  choose  d  in 
(2.1.1)  to  he  a  small  multiple  of  each  of  the  unit  vectors  e1,  ez, . . . ,  en.  bet  £,■  be  the  chosen  difference  interval 
for  the  t1*1  coordinate.  When  d  —  Sie'  the  j"1  equation  in  (2.1.1)  is 

Sihij(x°)  =  gj(x°  +  S,e')  -  g,(x°), 

thus  allowing  us  to  solve  for  an  approximation  to  all  of  row  t  of  H(x°).  For  any  smooth  F,  H  is  symmetric, 
and  so  the  II  resulting  from  the  usual  method  is  symmetrized  by  setting 

II  -  \(U  +  fiT). 

Though  equations  (2.1.1)  are  trivial  to  solve  when  unit  vectors  arc  used  for  differencing,  the  procedure  lias 
one  great  drawback  When  considering  the  running  times  of  optimisation  routines,  the  standard  assumption 
is  that  calculating  g(x)  is  expensive  relative  to  other  operations.  The  value  y(x°)  must  lie  calculated  by  an 
optimization  routine  for  other  purposes,  so  we  do  not  include  it  in  evaluation  counts.  In  addition  to  g(x°), 
the  usual  procedure  requires  n  gradient  evaluations  for  each  approximation  of  II,  and  so  can  be  prohibitively 
expensive  even  for  moderately  large  n.  As  is  shown  later  in  Section  2.5,  when  the  Hessian  has  no  special 
structure  the  number  of  gradient  evaluations  cannot  be  reduced  below  n,  making  explicit  approximation  of 
II  through  finite-differencing  unattractive.  In  some  contexts,  adequate  approximations  to  II  can  lie  obtained 
efficiently  through  other  means,  see,  r.g.,  the  vast  literature  on  quasi-Newton  methods  ((Jill,  Murray  and 
Wright  (1981),  Section  4.5.2).  However,  even  with  such  methods,  an  explicit  Hessian  approximation  can  be 
useful  for  distinguishing  between  a  saddle  point  ami  a  true  minimum. 
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The  saving  grace  is  that  II  often  has  special  structure  that  can  be  exploited.  Sometimes  it  is  known 
from  the  structure  of  the  problem  that  htJ(x)  0  for  some  t  and  j,  independent  of  x.  A  Hessian  is  said  to  be 
•parse  when  sucli  information  is  known  about  a  large  proportion  of  its  entries.  It  is  convenient  to  represent 
the  sparsity  information  by  a  matrix  of  X’s  and  0’»,  where  “0"  represents  an  entry  known  to  be  zero  for  all 
x  of  interest,  and  “X”  represents  any  other  entry  Such  a  matrix  is  called  the  sparsity  pattern  of  II ,  and 
inherits  its  symmetry.  For  example,  if  F(x,y,z)  —  x3  +  (y  +  z)3,  the  sparsity  pattern  of  its  Hessian  is 

X  0  0\ 

0  X  X  . 

0  XX/ 

As  a  slight  generalization,  note  that  it  may  also  be  known  that  an  A,j(x)  is  a  non-zero  constant 
independent  of  x.  Such  an  entry  can  be  treated  almost  like  a  zero  in  this  context,  the  only  difference 
being  that  the  constant  must  be  subtracted  from  the  right  hand  side  of  (2.1.1)  at  the  appropriate  |>oint.  For 
simplicity,  wc  shall  subsequently  consider  only  the  zero/non-zero  distinction,  though  only  minor  changes  are 
needed  to  adapt  the  results  to  the  constant  case  as  well. 

As  an  example  of  how  sparsity  can  be  used  to  approximate  a  Hessian  more  cdjcicntly,  consider  the 
“arrowhead”  sparsity  pattern  (see  Powell  and  Toml  (1979),  p.  1060): 

/X  0  0  •••  X\ 

0X0  X 

0  0  X  X 

tX  X  X  •••  X> 

Choose  the  first  dilTcrcnce  direction  to  be  dl  —  dnc".  The  resulting  jth  equation  in  (2.1.1)  is 

*«A>rt(z°)  <Jj(z°  +  dl)  ~  9j(x°)- 

These  equations  can  be  solved  for  hjn(x°),  j  —  1,2,  ...,n.  Choose  the  second  difference  direction  to  be 
d*  =  ;  6^'.  The  jih  equation  of  (2.1.1)  is  now 

=  !>AX  f  d*)  ~  nAx°)>  i  =  l»  2, ....  n  -  1, 

yielding  the  remaining  non-zeros  in  //.  Thus  the  n  evaluations  necessary  in  the  usual  method  have  been 
reduced  to  just  two  by  using  special  diflcrcnce  directions.  With  the  assumption  that  gradient  evaluation 
is  expensive,  this  is  a  significant  saving  and  makes  finite- difference  approximations  feasible  for  large-scale 
optimization. 

Suppose  further  that  II  has  the  following  truncated  arrowhead  structure: 

0  0  0  0  X\ 

0  0  0  0  X 

0  OX  OX. 

0  0  0  X  X 

VX  X  X  X  X/ 

Note  that  the  first,  second  and  fifth  equations  of  (2.1.1)  arc  not  used  in  the  first  evaluation.  If  y  can  be 
evaluated  component  by  component,  more  time  could  be  saved  by  evaluating  only  the  third  and  fourth 
components  of  y(xn  t  </').  Hut  it  often  happens  that  the  components  of  y  have  common  subexpressions  that 
make  evaluating  one  component  of  y  nearly  as  cx|s-nsive  as  evaluating  all  of  y,  causing  these  apparent  savings 
to  vanish.  Also,  many  times  y  is  available  only  as  a  user-written  black  box,  and  so  it  is  not  possible  to  specify 
that  only  a  subset  of  components  Im>  evaluated.  These  considerations  lead  us  to  assume  henceforth  that  g 
can  be  evaluated  only  as  a  whole,  not  component  by  component.  However,  it  is  easy  to  sec  how  to  adapt 


(2.1.2) 


! 
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the  output  of  many  of  the  available  heuristics  so  that  one  can  take  advantage  of  component  by  component 
evaluation  when  it  is  available. 

Our  general  goal  now  is  to  approximate  the  Hessian  of  a  given  function  F  when  //  has  a  known  sparsity 
pattern,  using  the  minimum  possible  number  of  gradient  evaluations.  This  problem  has  been  previously 
considered  by  several  authors,  including  Powell  and  Toint  (1979),  Coleman  and  Mor£  (1982)  and  Thapa 
(1980),  Section  5.1.  In  order  for  an  approximation  method  to  be  practical  it  must  be  fast  and  it  must  be 
numerically  stable.  Searching  for  a  balance  between  these  two  competing  goals  has  led  previous  researchers 
to  consider  the  problem  under  various  restrictions  on  the  form  of  equations  (2.1.1).  A  systematic  way  of 
classifying  these  restrictions  is  presented  in  Section  2.2,  and  a  subclassification  of  the  so-called  direct  methods 
is  presented  in  Section  2.3. 

In  order  to  better  understand  the  examples  in  Sections  2.2  and  2.3  and  to  be  able  to  analyse  the 
complexity  of  different  classes  of  methods,  it  is  helpful  to  have  a  graph  representation  of  sparsity  patterns. 
Since  the  sparsity  pattern  of  II  is  symmetric,  a  natural  model  to  choose  is  the  graph  §{H)  with  node  set 
TV  =  {1,2,  ...,n}  and  edges  £>'={{  i,j  }  |  A,yis  not  known  to  be  0  }.  Thus  the  symmetry  of  II  corresponds 
to  the  undiroc.tcdness  of  5(H)-  When  drawing  pictures  of  ${H),  loops  that  stem  from  non-sero  diagonal 
entries  in  //  will  be  suppressed.  For  example,  the  following  sparsity  pattern  corresponds  to  the  displayed 
graph: 


/X  X  0  0\ 

X  X  X  0 
0  X  X  X 

Vo  o  xx/ 


Conversely,  any  undirected  graph  (possibly  with  loops)  clearly  corresponds  to  the  sparsity  pattern  of  some 
matrix  II. 

Using  this  graph  model,  it  is  proved  in  Section  2.4  that  all  the  variations  of  the  direct  methods  considered 
in  Section  2.3  arc  NP-Complete.  Section  2.4  concludes  with  some  positive  results  about  heuristics  for  direct 
methods  to  counterbalance  the  negative  complexity  results. 

In  order  to  be  able  to  quantify  the  performance  of  heuristics,  it  would  be  useful  to  have  an  easily 
computable  lower  bound  on  the  number  of  evaluations  needed.  The  number  or  evaluations  needed  when  no 
restrictions  are  placed  on  the  difference  directions  Is  clearly  a  lower  bound.  Some  results  about  this  bound 
and  how  to  compute  it  efficiently  are  presented  in  Section  2.5. 

This  chapter  of  the  thesis  concludes  with  Section  2.6,  which  points  out  the  unresolved  questions  in  the 
preceding  sections  and  suggests  areas  for  future  research. 


2.2.  Classifying  Approximation  Methods 

Denote  (g(r°  r°))T  by  A1  and  an  approximation  to  Il(x)  by  II{x).  An  approximation  method 

is  an  algorithm  that,  when  given  F  and  the  sparsity  pattern  of  its  Ilcssian,  chooses  fixed  (independent  of  x) 
difference  directions  d'.d2,  . .  ,dk  so  that  the  nfc  equations 

/7(x°)d'  =  A‘,  1=1,2,...,*,  (2.2.1) 

(which  are  just  (2.1.1)  re-written  in  the  new  notation)  have  a  subsystem  that  can  be  uniquely  solved  to  yield 
//(x°).  When  the  i,j  entry  of  the  sparsity  pattern  is  *ero,  l»,y  (  =  liJt)  is  set  to  *ero  in  (2.2.1).  Because  of 
the  assumed  symmetry  of  the  Hessian,  variables  h,y  and  arc  identified.  Many  approximation  methods 
determine  li,y  and  hJt  as  if  they  were  different,  which  leads  to  over-determined  linear  systems  when  they 
are  identified.  This  is  the  reason  why  it  is  required  that  only  a  subsystem  of  (2.2.1)  uniquely  determine  the 
non-*cro  h,y's.  For  example,  the  usual  method  that  chooses  rf  —  Sit',  i  =  1,2,  ...,n,  is  an  approximation 
method,  where  the  unknown  hIJt  i  ^  j,  appears  in  both  the  jth  equation  for  <?  and  the  ith  equation  for  tP. 
Deleting  equation  j  for  d'  when  j  <  i  leads  to  a  subsystem  of  (2.2.1)  that  can  be  uniquely  solved  for  //(x°). 
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The  efficiency  of  an  approximation  method  depends  on  two  factors.  First,  since  evaluating  the  gradient 
is  assumed  to  be  expensive,  the  ideal  approximation  method  would  minimize  k,  the  number  of  difference 
directions  used  to  form  equations  (2.2.1).  Finding  the  k  difference  directions  is  a  one-time  cost;  since 
the  sparsity  pattern  of  II  is  independent  of  x,  ttic  difference  directions  are  generated  at  the  start  of  the 
optimization  and  used  at  every  iteration  thereafter. 

Second,  once  a  set  of  difference  directions  has  been  found,  equations  (2.2.1)  must  be  solved  for  II(x°). 
Solving  a  diagonal  system  of  equations  like  (2.2.2)  is  both  faster  and  more  numerically  stable  than  solving 
a  general  system  of  equations.  The  solution  cost  is  incurred  whenever  the  Hessian  is  to  be  evaluated.  If  an 
approximation  method  generates  a  set  of  equations  (2.2.1)  which  arc  ill-conditioned,  H  may  not  be  a  good 
approximation  to  II,  and  the  convergence  rate  of  an  optimization  procedure  may  suffer.  Also,  if  solving 
equations  (2.2.1)  is  very  difficult,  then  the  approximation  method  may  contribute  an  unacceptable  overhead 
to  optimization. 

Thus  a  smaller  k  may  lead  to  fewer  gradient  evaluations  per  iteration,  but  also  to  spending  more 
time  in  solving  equations  per  iteration.  This  trade-off  has  led  researchers  to  the  realization  that  practical 
approximation  methods  may  need  to  restrict  the  form  of  equations  (2.2.1).  Powell  and  Toint  (1979)  classify 
approximation  methods  as  follows.  If,  as  in  (2.2.2),  the  approximation  method  always  gives  rise  to  a  diagonal 
subsystem,  it  is  called  a  direct  method  (since  the  Hessian  can  be  solved  for  directly).  The  usual  unit  vector 
method  is  a  direct  method.  If  the  subsystem  can  always  be  permuted  into  triangular  form,  the  approximation 
method  is  called  a  substitution  method  (since  the  elements  of  the  Hessian  can  be  solved  for  by  'rnple 
substitution).  Finally,  if  subsystems  can  arise  which  cannot  he  permuted  into  triangular  form,  the  method 
is  called  an  elimination  method  (since  some  sort  of  Gaussian  elimination  must  be  applied  to  solve  for  II). 
Examples  of  substitution  and  elimination  methods  arc  exhibited  in  Section  2.5. 

The  next  section  will  show  that  it  is  useful  to  break  down  these  classes  of  methods  further  into  subclasses. 
Given  a  particular  subclass  S,  our  goal  is  to  find  a  fast  optimal  method  in  S.  That  is.  given  a  sparsity 
pattern  //,  an  optimal  method  in  S  generates  difference  directions  dl,d*, . . .  ,dk  so  that  (1)  the  restrictions 
of  S  are  satisfied  and  (2)  k  is  as  small  as  possible  for  this  //.  To  be  practical,  an  optimal  method  must  be 
fast  in  finding  the  df\  although  this  is  a  one-time  cost,  it  should  not  contribute  too  much  overhead  to  the 
total  optimization  time.  After  the  next  section  classifies  the  direct  methods,  Section  2.4  shows  that  in  a 
certain  technical  sense,  there  are  no  fast  optimal  direct  methods  for  any  of  the  classes  of  direct  methods. 


2.3.  Classifying  Direct  Methods 


This  section  considers  direct  methods  for  approximating  Hessians.  An  entry  of  II  which  is  not  known 
to  be  zero  is  called  an  unknown,  and  unknowns  h,j  and  hy<  are  identified.  The  defining  restriction  of  direct 
methods  turns  out  to  be  equivalent  to  a  relationship  between  the  sparsity  pattern  of  II  and  the  zero/non-zero 
structure  of  the  dl.  To  sec  the  equivalence,  regard  the  non-zero  components  of  each  dl  as  specifying  a  subset 
St  of  the  column  indices  of  II .  The  set  Si  is  called  the  1th  group  of  columns  of  II.  When  two  columns 
belonging  to  Si  both  I  we  an  unknown  in  row  t,  we  say  that  there  is  an  overlap  in  St  in  row  For  example, 
considi  the  spars'  pattern: 


The  group  corresponding  to  d  =  (1,  l,0)T  consists  of  columns  (  1,2  },  which  overlap  in  row  3  but  not  in  row 

2. 
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The  set  of  directions  {d1 }  chosen  by  a  direct  method  gives  rise  to  equations  (2.2.1).  In  order  for  (2.2.1) 
to  have  a  diagonal  subsystem  that  can  be  solved  for  all  the  unknowns  in  //,  each  unknown  h%)  (or  h)t)  must 
appear  by  itself  in  at  least  one  equation,  say  the  equation  for  row  t  in  the  set  of  equations  arising  from  df . 
This  condition  implies  that  j  6  Si  and  that  there  are  no  other  unknowns  in  row  t  of  Si  (otherwise,  A,y  would 
not  be  the  only  unknown  in  the  i**1  equation  of  d1).  Thus,  for  htJ  to  he  determined  directly,  Si  can  have  no 
overlap  in  row  i.  The  family  of  column  subsets  {  Si  }  corresponding  to  the  directions  {  d1  }  computed  by  a 
direct  method  must  therefore  satisfy  the 

Direct  Cover  Property  (DCP):  For  each  unknown  A,y  =  Ay,-  there  must  be  an  5/  with  either  j  £  Si 
and  no  overlap  in  row  i,  or  i  £  St  and  no  overlap  in  row  j. 

Conversely,  given  a  sparsity  pattern  and  a  family  {Si}  of  column  subsets  satisfying  (DCP),  the  set  of 
difference  directions  dl  —  Sygv,  can  he  defined,  which  correspond  to  a  direct  method.  Thus  finding  an 
optimal  direct  method  is  equivalent  to  the  purely  combinatorial  problem  of  finding  a  minimum  cardinality 
direct  cover. 

2.3.1.  Approximation  of  Sparse  Jacobians 

Direct  approximation  methods  for  Hessians  have  evolved  out  of  previously  studied  methods  for  ap¬ 
proximating  sparse  Jacobians  of  functions  F:  RB  — *  Rm  by  finite-differencing.  An  example  of  the  idea  in  this 
case  is  that  if  a  Jacobian  has  the  sparsity  pattern 

/X  0  0  X  ON 

X  0  0  o  X  j 

o  x  o  o  x  r 

Vo  0  X  X  0/ 

then  differencing  F  along  dl  =  (1,1, 1,0,0)  and  dP  —  (0,0,0, 1,1)  approximates  the  Jacobian  in  just  two 
function  evaluations  rather  than  five,  since  there  is  no  overlap  among  the  first  three  or  last  two  columns. 
There  are  also  other  applications  of  finding  minimum  groupings  of  non-overlapping  columns;  see,  e.g., 
Den'  ker,  P'lrro  and  lleuft  (lflRt)  or  Diirrc  and  Fcls  (1980).  The  possibility  of  such  reductions  in  function 
evaluations  has  caused  the  problem  of  finding  a  minimum  set  of  dilTcrence  directions  to  be  thoroughly 
studied.  Several  heuristics  for  finding  “good”  sets  of  directions  have  been  investigated  (sec  Curtis,  Powell 
and  Reid  (197-1),  and  Coleman  and  More  (1981)).  The  computational  complexity  of  finding  an  optimal  set 
of  dilTcrence  directions  in  a  direct  Jacobian  approximation  method  has  also  been  investigated,  resulting  in 
the  next  theorem. 

Theorem  2.3.1:  (Coleman  and  More  (1981),  Theorem  3.3,  and  Newsam  and  Ramsdcll  (1982),  Theorem  1) 
Finding  an  optimal  set  of  difference  directions  for  directly  approximating  a  Jacobian  is  NP-Complcte.  D 

2.3.2.  Classification  by  Type  of  Overlapping 

The  Hessian  problem  is  significantly  more  difficult  than  the  Jacobian  problem  because  of  the  symmetry 
of  the  matrix.  Various  heuristic  approaches  to  finding  optimal  direct  covers  have  been  proposed.  We  review 
next  the  history  of  these  efforts,  and  then  propose  a  new  way  of  classifying  direct  methods. 

One  obvious  approach  to  approximating  Hessians  is  simply  to  apply  one  of  the  Jacobian  methods  to 
the  symmetric  sparsity  pattern  of  the  Hessian.  Such  an  approach  leads  to  families  of  subsets  of  columns 
whose  subsets  have  no  overlap  in  any  row.  As  long  as  every  column  is  in  some  subset,  such  a  family  clearly 
satisfies  (DCP),  so  that  any  direct  Jacobian  approximation  method  immediately  becomes  a  direct  method 
for  Hessians.  A  direct  cover  which  has  no  overlap  in  any  row  of  any  group  is  called  a  non-overlap  direct 
cover  (NDC).  The  first  NDC  heuristic  for  Jacobians  was  proposed  by  Curtis,  Powell  and  Reid  (1974),  and 
it  was  later  improver!  by  Coleman  and  More  (1981). 

Powell  and  Toinl  (I'  .’9)  recognised  that  a  significant  decrease  in  gradient  evaluations  can  be  achieved 
by  taking  advantage  of  symmetry.  For  example,  recall  the  arrowhead  sparsity  pattern  (2.1.2).  Since  every 
column  overlaps  with  every  other,  any  NDC  must  contain  at  least  n  groups  (and  of  course  n  suffice).  But,  as 
was  shown  in  Section  2.1,  if  the  first  difference  direction  is  d1  —  Snen,  then  it  docs  not  matter  if  subsequent 
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columns  overlap  in  row  n  since  the  unknowns  in  row  n  have  already  been  determined  by  the  first  evaluation 
and  symmetry.  Exploiting  this  observation  reduces  the  number  of  gradient  evaluations  for  this  example  from 
n  to  two. 

In  general,  if  Si  has  no  overlaps  in  any  row  and  if  j  G  Si,  then  by  symmetry  row  j  is  completely 
determined  after  the  first  evaluation,  and  so  overlaps  in  row  j  can  be  ignored  in  later  groups.  Thus  we 
consider  families  of  column  subsets  with  the  following  property: 

Sequential  Overlap  Property  (SOP):  Group  St  can  have  an  overlap  in  row  t  only  if  there  is  a  k  <  l 
with  i  €  Si,- 

For  an  unknown  A,y,  define  the  minimum  index  group  of  AtJ  to  be  p  =  mln{/  i  either  i  or  j  belongs 
to  group  l }.  When  (SOP)  holds,  AtJ  must  be  the  only  unknown  in  its  row  in  group  p,  so  that  (l)CP)  is 
satisfied.  Direct  covers  with  (SOP)  arc  called  sequential  overlap  direct  covers  (SeqDC).  Note  that  any 
NDC  algorithm  that  generates  its  groups  sequentially  can  easily  be  converted  into  a  SeqDC  algorithm  by 
deleting  the  columns  and  corresponding  rows  of  group  St  before  finding  group  St+i- 

Powell  and  Toint  (1979)  showed  that  there  arc  sparsity  patterns  for  which  an  optimal  SeqDC  is  not  an 
optimal  direct  cover.  Their  example  is: 

/X  X  X  X  0  0  \ 

X  X  X  0  X  0 

X  X  X  0  0  X 

X  0  0  X  0  0 

0X00X0 

Vo  0  x  0  0  X/ 

It  is  easy  to  sec  that  any  SeqDC  for  (2.3.1)  requires  at  least  four  groups,  but  lhal  {  {  1, 5  },  {  2,6  },  {  3, 4  }  }  is 
a  direct  cover  of  size  only  three.  Thapa  (1980),  Section  5.1,  proposed  a  method  that  tries  to  take  advantage 
of  such  situations,  and  that  produces  a  direct  cover  which  may  not  satisfy  (SOP).  Direct  covers  that  do  not 
necessarily  satisfy  any  additional  restrictions  are  called  simultaneous  overlap  direct  covers  (SimDC); 
any  direct  cover  is  a  SimDC. 

Note  that  NDC  C  SeqDC  C  SimDC  and  that  these  inclusions  are  strict,  as  shown  by  (2.1.2)  and  (2.3.1). 


2.3.3.  Classification  by  Partitioning 


All  of  the  heuristic  methods  mentioned  above  produce  partitioned  direct  covers,  that  is,  every  column 
belongs  to  exactly  one  group.  Even  when  there  arc  no  zero  columns,  not  every  column  need  belong  to  some 
group  in  a  valid  direct  cover.  Consider 


{  {  2  }  }  is  a  direct  cover  (in  fact,  an  NDC)  of  minimum  size,  yet  column  one  belongs  to  no  group.  But  when 
h,t  is  an  unknown,  column  t  must  belong  to  some  group  in  order  for  to  be  determined.  When  //  arises 
from  unconstrained  optimization,  tin  is  usually  non-zero  for  all  :,  which  implies  that  all  columns  must  be 
in  some  group.  Thus,  unless  otherwise  stated,  henceforth  it  is  assumed  that  A„  is  an  unknown  for  all  i,  so 
that  only  direct  covers  containing  every  column  need  to  be  considered. 

Now  consider  a  SeqDC  in  which  columns  may  appear  in  more  than  one  group.  If  every  occurrence  of 
every  column  in  a  group  other  than  its  smallest  index  group  is  deleted,  (SOP)  holds  since  each  unknown 
is  determined  by  its  minimum  index  group.  That  is,  since  each  unknown  is  determined  by  a  column  in  its 
minimum  index  group,  any  later  occurrences  of  that  column  arc  su|>erlluous,  and  there  can  be  no  occurrences 
of  that  column  in  an  earlier  group  by  the  definition  of  minimum  index  group.  Thus,  for  SeqDCs  (and  also 
for  NDCs  since  NDC  C  SeqDC),  it  suffices  to  consider  only  partitioned  direct  covers. 
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Unfortunately,  the  same  is  not  true  for  SiinDCs.  Consider  the  sparsity  pattern: 

/X  0  0  X  X  X  X\ 

0  X  X  0  X  X  0 

0  X  X  X  0  0  X 

xoxxxoo.«-» 

X  X  0  X  X  0  0 

X  X  0  0  0  X  o 

Vx  0  X  0  0  0  X/ 

Laborious  calculations  verify  that  any  partitioned  simultaneous  direet  cover  (PSimDC)  must  use 
at  least  five  groups,  whereas  {  {  1,2  },{  1,3},  {4,  6  },  {  •*»,  7}  }  is  a  general  simultaneous  direct  cover 
(GSimDC)  which  uses  only  four  groups.  This  is  the  smallest  possible  such  example  in  terms  of  number  of 
columns,  Eisenstat  (referenced  in  Coleman  and  More  (I9S2),  equation  (2.1))  has  discovered  an  infinite  class 
of  such  examples. 

These  two  classifications  give  four  distinct  subclasses  of  direct  covers:  NDC,  SeqDC,  PSimDC  and 
CSimDC.  The  next  section  shows  that  finding  an  optimal  member  each  of  the  four  classes  is  NP-CompIcte. 

2.4.  The  Complexity  of  Direct  Methods 

The  main  purpose  of  this  section  is  to  prove  the  following  four  theorems: 

Theorem  2.4.1:  Finding  an  optimal  NDC  is  NP-Complete. 

Theorem  2.4.2:  Finding  an  optimal  SeqDC  is  NP-Complete. 

Theorem  2.4.3:  Finding  an  optimal  PSimDC  is  NP-Complete. 

Theorem  2.4.4:  Finding  an  optimal  CSimDC  is  NP-Complete. 

Recall  from  Section  1.2  that  to  prove  that  a  problem  X  is  NP-Complete  it  is  necessary  to  reduce  a 
known  NP-CompIctc  problem  to  problem  X .  We  shall  use  three  known  NP-Complete  problems.  The  first 
is  the  direct  Jacobian  approximation  problem  already  discussed  in  Section  2.3.  The  second  is  the 

3-Satisfiability  Problem  (3SAT):  Let  t»i,  u 2, . . . ,  u„  be  a  set  or  atoms,  with  the  corresponding  set  of 
literals  I,  —  {  U|,5(,  uj,  tij,  . . . ,  un,  u„  }.  Let  C  —  {  C\,C\, . . . , Cm  }  be  a  set  of  3-clauses  drawn 
from  //,  that  is,  each  C,  C  /»,  and  j(7,[  =  3.  Is  there  a  truth  assignment  r  :  -* 

{  tnicjalxc  }  such  that  each  C,  contains  at  least  one  u ,  with  r(tx,)  =  true  or  at  least  one  tj,  with 
r(u,)  —  fa/so? 

The  set  of  clauses  is  an  abstraction  of  a  logical  formula;  imagine  the  clauses  as  parenthesized  subformulae 
whose  literals  are  connected  by  ‘or’,  with  all  the  clauses  connected  by  ‘and’.  Then  a  satisfying  truth 
assignment  makes  the  whole  formula  true. 

The  third  problem  is  the 

3-Color  Graph  Coloring  Problem:  Given  a  graph  Q  =  ( V,li ),  docs  there  exist  a  function  /:  V  -* 
{  1,2, 3  }  such  that  f(v)  ^  /(«)  whenever  {«,«  }  £  /s’? 

This  problem  remains  NP-CompIcte  even  when  §  is  restricted  to  be  planar  (sec  Carey  and  Johnson 
(1979),  Theorem  4.2)  Note  that  3GCP  is  a  restricted  form  of  the  classic  problem  of  finding  the  chromatic 
number  of  a  grapli  The  NP(  ompleteness  of  the  Jacobian  direct  approximation  problem  was  slated  in 
Theorem  2.3.1  The  NP  Completeness  proofs  of  3SAT  and  3GCP  are  referenced  in  Garey  and  Johnson 
(1979),  Problems  LU21  and  j(!T1]  respectively. 

A  graph  operation  that  is  needed  in  two  of  the  proofs  is  the  notion  of  edge  replacement.  Given  a 
graph  §  and  edge  e  {  «,  t;  }  in  §,  and  graph  K  with  two  distinguished  vertices  s  and  t  called  terminals, 
the  result  of  replacing  edge  r  of  §  by  K  is  obtained  by  removing  r  from  Q  and  identifying  u  with  s  and  v 
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with  t.  For  example,  if 


then  replacing  e  with  K  yields  the  graph: 
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All  the  proofs  depend  on  the  equivalence  between  Hessian  sparsity  patterns  and  undirected  graphs 
mentioned  in  Section  2.1.  Throughout  this  section,  Q  denotes  the  graph  associated  with  II. 

If  a  polynomial  algorithm  exists  for  finding  an  optimal  direct  cover  of  any  type,  it  must  in  particular 
work  on  sparsity  patterns  whose  diagonal  entries  are  all  unknowns.  Thus  it  ran  be  assumed  without  loss 
of  generality  that  every  column  index  belongs  to  at  least  one  group.  A  direct  cover  then  associates  with 
each  vertex  of  Q  the  index  of  the  group  (indices  of  groups  in  the  case  of  GSunDCs)  to  which  its  column 
belongs.  Such  an  association  of  integers  to  vertices  is  a  graph  coloring,  whose  type  depends  on  the  nature  of 
the  associated  direct  cover.  In  each  of  the  four  cases,  wc  shall  establish  what  sort  of  coloring  is  involved  and 
show  that  the  problem  is  Nl’-Complctc.  Note  that  in  every  case  two  adjacent  vertices  »  and  j  cannot  be  the 
same  color  because  then  neither  hi,  nor  hjj  could  be  determined  directly  due  to  overlap  with  /»,y.  Thus  the 
generaliied  graph  coloring  must  be  a  usual  graph  coloring  as  well. 

2.4.1.  The  Complexity  of  Finding  Optimal  NOCs 

We  shall  give  two  proofs  of  Theorem  2.4.1,  the  first  very  general  and  complicated,  the  second  quite 
specific  and  short. 

In  an  NDC,  two  columns  in  the  same  group,  i.e.  two  vertices  of  the  same  color,  cannot  overlap.  Column 
i  overlaps  column  j  if  hj,,  and  h* }  are  both  unknowns,  i.e.,  if  vertices  t  and  j  arc  both  adjacent  to  vertex 
k.  Thus  in  the  coloring  of  §,  no  two  vertices  of  the  same  color  can  have  a  common  neighbor.  Conversely, 
given  such  a  coloring  of  Q,  clearly  no  two  columns  in  the  associated  groups  can  overlap. 

If  distance  from  vertex  «  to  vertex  j  in  the  graph  is  measured  by  “minimum  number  of  edges  in  any 
path  between  i  and  j” ,  then  any  two  vertices  of  ihe  same  color  must  be  more  lhan  two  units  apart.  In  the 
usual  Graph  Coloring  Problem,  any  two  vertices  of  the  same  color  must  be  more  than  one  unit  apart.  Then 
a  common  generalization  is  a  proper  distance-!;  coloring  of  a  graph  £,  which  is  a  partition  of  the  vertices 
of  Q  into  classes  (colors)  so  that  every  pair  of  vertices  of  the  same  color  is  more  than  k  units  apart.  The 
associated  optimization  problem  is  the 

Distance-!;  Graph  Coloring  Problem  (DlcGCP)i  Given  a  graph  §,  find  a  proper  distance-!;  coloring 
of  Q  in  the  minimum  possible  number  of  colors. 

The  usual  Graph  Coloring  Problem  (CCP)  is  DlGCP,  and  the  optimal  NDC  problem  is  equivalent  to 
D2GCP.  Wc  shall  use  this  equivalence  to  show  that  the  optimal  NDC  problem  is  NP-CompIcte  by  showing 
that  D2CCP  is  NP-Complclc;  in  fact,  we  shall  show  the  stronger  result  that  DifcGCP  is  NP-CompIctc  for 
any  fixed  k  >  2. 

To  show  that  DifcGCP  is  NP-Complete,  3SAT  will  be  encoded  into  it.  To  facilitate  the  encoding,  DifcGCP 
must  be  recast  as  a  decision  problem.  As  is  standard  with  optimization  problems,  DfcGCP  is  re-phrased  to 
“Is  there  a  distance-!;  coloring  that  uses  p  or  fewer  colors?”  In  a  slight  abuse  of  notation,  let  DfcCCP  refer 
to  both  the  optimization  problem  and  the  related  decision  problem.  Our  encoding  is  a  generalization  of  the 
one  found  in  Karp’s  original  proof  (1972)  of  the  NP-Complctoness  of  CCP.  The  first  proof  of  Theorem  2.4.1 
requires  the  exclusion  of  the  case  in  which  a  clause  contains  both  an  atom  and  its  negation.  Hut  such  clauses 
arc  always  trivially  satisfied,  and  so  henceforth  "3SAT"  will  mean  “.'{-Satisfiability  without  trivial  clauses" 
Also,  it  is  assumed  (without  loss  of  generality)  that  n  >  4.  Then  for  any  clause  Cj,  there  is  an  atom  Ui  with 
u,  g1'  Cj,  and  $  Cj. 
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Given  a  3SAT  problem  P,  we  construct  from  it  a  decision  problem  on  a  graph  5k(P).  If  P  has  atoms 
ti| ,  uj, . . .  ,un  and  clauses  Ci,  Ct,  . ,  Cmi  let  h  =  [fc/2]  and  p  —  2nk  -*  m(k  —  1).  Let  V  and  Ii  denote  the 
vertices  and  edges  of  £t(P),  and  define  them  by: 


V  = 


St 

Fr,T\,  r=\ 

C'„  r  ■=  0, ...  tk  —  1 

/;,  r  —  1 . Ir  —  1 


literal  vertices,  false  ver¬ 
tices,  true  vertices 


s  =  1, . . .  ,m 


clause  vertices,  intermediate 
vertices 


all 


{ttj.B,  }  all  • 

vnx*'} 

{ f;.t ;} 

{ikrl.c°.} 

{/>,}  if  u*  €  c. 

{/].*}  >f  S,  €  C.J 

{«,.*?} 


all  r,  all  t  ^  j 


all 


all  t  5^  j 


all  s  t,  all  t 


>  * 
r>h 
r>h 
r  >  0 

<c*-',rn  «„  s,  <?c.\  .. 
{c?-,.r;*}  all.  J  - 

}§:??{ 


k  odd 


u,-,  5 ,■  different  colors 


all  P’s,  7”s  different  colors 


Cj  different  color  than  its 
literals 


ut,  Hj  can  only  be  P*  or 

T\ 

C*,r  >  0,  different  from 
each  other  and  P’s  and 
T’s 


<7°  can  only  be  P1  colors 
of  its  literals 


Note  that  £*(P)  is  considered  only  for  k  >  2,  implying  that  /»  4-  1  <  fc,  so  that  h  +  1  makes  sense  as  a 
superscript  for  the  P’s  and  the  7”s  The  global  structure  of  5k(P)  looks  like 


Three  propositions  about  the  structure  of  a  proper  distance-fc  coloring  of  £*(P)  are  needed  for  the  proof. 

Proposition  2.4.5:  The  vertices  F\,  T\ ,  and  CJ_I,  t  =  1, . . . ,»,  s  =  1, . . .  ,m,  r  =  t, . . . ,  k  must  all  have 
different  colors,  thus  using  up  all  p  colors. 

Proof:  Consider  the  length  k  paths 


W-f) - F) 

J  \r;-r* - r) 
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winch  demonstrate  that  all  F’a  and  7”s  must  be  different  colors.  Now  consider  the  length  k  —  1  paths 

{pfc+l _  p^+2  _  _ yk 

ri+‘_Tf+t  — ’’_r*  {2AA) 

which  show  that  no  C\,  0  <  q  <  h  can  be  any  F\  or  T\  color,  r  >  h.  The  length  at  most  k  -  1  paths 


/"»jt — i  r'*-*  _ nh )r *  * 

('s  1 71/1+1 _ 'Ph+* 


/i+2  L*k 

i  *  «' 

H+2 _ _ yi* 


show  that  no  CJ,  h  <  q  <  k  can  be  any  F[  or  T\  color,  r  >  h.  Let  l  be  an  index  such  that  u/  ^ C,  and 
tit  </C,t  And  consider  the  length  k  —  1  paths 


k  odd 
k  even 


r,_  _rk-l—iT'~Ti~l - Ti  kodd 

—  T? - T\  k  even 

which  show  that  no  (7J,  0  <  9  <  /»  can  be  any  T'  color,  r  <  h.  '1'he  length  k  paths 

- F*  kodd 

J  \c*— P{*  —  Fth - F*  it  even 

show  that  no  CJ,  0  <  q  <  h  can  be  any  color,  r  <  h.  The  length  k  paths 

{i'A+l t'S  el 

ri  ri  rt 

T-+l—T*  —  -~T\ 

show  that  no  C\,  h  <  q  <  k,  can  be  any  F\  or  T'  color,  r  <  /».  The  length  k  -  1  paths 

~c‘"1 - c‘  kodd 

•  •  •  \ci*—  C?  — C?-' - Cl  it  even 


(2.4.2) 


(2-4.3) 


k  odd 
k  even 


(2.4.4) 


show  that  no  (7J  can  be  the  same  color  as  any  C\,  0  <  q,  r  <  h.  Finally,  the  length  k  1  path 

CJ—  C] - CJ— C?— C?+l - Cf~l  (2.4.5) 

shows  that  no  Cl  can  be  the  same  color  as  any  C\,  h  <  q,  r  <  it.  0 

Since  F J,  T\  and  CJ,  q  >  0,  use  up  all  the  colors,  the  colors  are  subsequently  referred  to  by  these  vertex 
names. 

Proposition  2.4.6:  Vertices  u,  and  u,  must  be  colored  F\  and  T\  in  some  order,  »  =  1, . ..  ,n. 

Proof:  Let  j  £  i  and  consider  the  length  k  paths 


SH 


r$ 

—Fk~'  — 

-n 

L*k—  1 

— F — 

—F) 

n 

_  rpk  —  1  _ 

1  1 

-T? 

1 

1  J 

rpk-% 

2 

-T) 

C*"' 

— CJ"1  — 

—c\ 

(2.4.6) 


which  show  that  and  «,  cannot  bo  any  color  other  than  Fj  and  Tj.  Also,  w,  and  u,  certainly  cannot  be 
the  same  color.  0 

Thus  a  proper  distance-it  coloring  of  5*(/’)  induces  a  truth  assignment  on  the  literals. 
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Proposition  2.4.7:  If  the  literals  in  clause  C ,  have  indices  o,  b,  and  r,  then  C®  must  be  colored  Fla,  A'J, 
or  F\,  a  = 

Proof:  Note  that  C®  can  be  added  to  the  beginning  of  the  paths  in  (2.4.1),  (2.4.2),  (2.4.3),  (2.4.4)  and  (2.4.5), 
thus  excluding  all  colors  except  F}  from  6'®.  If  l  is  an  index  such  that  «|  Ct,  5(  (j?  C,,  then  the  edge 
F*  —  F*  can  be  dropped  from  (2.4.3)  and  6'®  can  be  added  to  the  beginning  to  show  that  C®  cannot  be  any 
F\  color  cither.  0 

Now  the  NP- Completeness  theorem  can  be  stated  and  proved,  which  will  then  immediately  imply  that 
finding  an  optimal  NDC  is  NP-Complete.  This  theorem  is  a  symmetric  version  of  a  result  in  Section  3  of 
Coleman  and  More  (I98t). 

Theorem  2.4.8:  For  fixed  k  >  2,  DkGCP  is  NP-Complete. 

Proof:  Since  the  si*e  of  Sk{P)  is  a  polynomial  in  m  and  n,  it  is  clear  that  the  above  reduction  of  3SAT  to 
D&GCP  can  be  carried  out  in  polynomial  time.  It  must  be  shown  that  there  is  a  satisfying  truth  assignment 
for  the  3SAT  problem  /’  if  and  only  if  the  graph  Qk{P)  has  a  proper  distancc-Jfc  coloring  in  p  or  fewer  colors. 

First  suppose  that  </*(P)  is  properly  distanee-fc  colored.  If  lh,  and  le  are  the  literals  contained  in  C„ 
then  the  length  k  path 

c°.-i -/?-* - i\-ii 

shows  that  C®  cannot  be  the  same  color  as  any  of  /*,  or  te.  But  C®  must  be  colored  n,  n.  or  F\  by 
Proposition  2.4.7.  By  Proposition  2.4.6,  each  /,  is  colored  either  F\  or  T\ ,  so  that  each  clause  must  contain 
at  least  one  true  literal  under  the  truth  assignment  induced  by  the  proper  coloring,  the  clauses  arc 
satisfiablc. 

Now  it  suffices  to  show  that  §k(P)  can  always  be  colored  in  p  or  fewer  colors  if  P  is  satishable.  Let  r 
be  a  satisfying  truth  assignment  for  Ct,  C2, . . . ,  Cm-  First  color  the  F’’t\,  Trt’s  and  Crm’n,  r  >  0,  as  decreed 
by  Proposition  2.4.5.  Color  u,  with  T\  if  r(u.)  —  true,  color  t»i  with  F\  otherwise;  color  u,  with  the 
complementary  color.  Bach  C.  has  at  least  one  true  literal,  say  /..  Color  C°  with  color  F'a.  Finally,  color 
l\  with  G"+1 ,  r  —  I, . . . , k  —  1,  where  the  subscript  on  C\  is  interpreted  modulo  m. 

We  now  show  that  this  coloring  is  proper.  The  colors  F\,  T\,  t  <  r  <  It,  each  appear  on  only  one 
vertex  and  so  are  proper.  Color  CTt+l  appears  on  exactly  two  vertices,  itselT  and  /'.  A  shortest  possible  path 
between  these  vertices  in  Sk(P)  is 

- - c:+t 

and  is  of  length  k  1 .  It  is  a  shortest  path  because  at  least  k  edges  must  be  used  to  get  from  layer  to  layer 
C\,  and  one  extra  edge  must  be  used  to  get  from  an  F  or  a  T  to  a  C.  Also,  any  alternative  path  between 
these  vertices  that  goes  through  a  C’2  has  at  least  k  ■+■  2h  edges  because  of  the  difference  in  subscripts,  and 
because  the  <?'’s  do  not  interconnect  for  r  <  h;  thus  color  C\  is  proper.  Color  T\  also  appears  on  exactly 
two  vertices,  itself  and  one  of  u,  or  u, .  A  shortest  possible  path  between  these  vertices  is  the  third  one  in 
(2.4.6)  with  T\  added  at  the  end.  For  j  «,  at  least  k  edges  must  be  used  to  get  from  the  ua  layer  to  the 
T\  layer,  and  an  extra  edge  is  necessary  to  go  from  an  »  vertex  to  a  j  vertex.  Any  other  path  between  these 
vertices  through  the  /’ s  uses  at  least  k  4  2h  —  I  edges,  and  so  T\  is  proper.  Finally,  F\  ran  ap|>ear  in  three 
places:  on  itself,  on  n,  or  fi,-,  and  on  any  number  of  f’J's  whose  clauses  contain  either  u,  or  ut.  As  with  u,  or 
Si  and  7',1  above,  u,  or  u,-  and  Fj  do  not  cause  a  conflict.  Some  shortest  possible  paths  between  Fj  and  any 
C®  are  those  in  (2.4.3)  with  <7®  adde<l  to  the  beginning.  Again,  at  least  k  edges  arc  necessary  to  go  from  the 
C 2  layer  to  the  Flm  layer,  and  an  extra  edge  is  necessary  to  go  from  an  l  vertex  to  an  :  vertex.  Any  other 
path  between  these  vertices  through  the  /’ s  uses  at  least  2k  edges,  so  that  no  (7®,  F]  pair  causes  a  conflict. 
Between  a  u,  or  fit,  and  a  ®,  some  shortest  possible  paths  arc 

«.i  /  /;  —  /* - 

«,/  - cl  -(7® 

of  lengths  k  and  k  -*•  I  respectively.  The  lirst  cannot  exist  because  of  the  truth  assignment  and  because  there 
are  no  trivial  clauses  Once  again,  the  second  must  use  k  edges  going  from  layer  u(  to  layer  C\ J,  and  an  extra 


u 
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edge  going  from  an  F  or  a  T  to  a  C ,  so  that  no  u,  or  5,,  C®  pair  conflicts.  Finally,  a  shortest  possible  path 
between  (7j  and  (7®  is  (2.4.4)  will)  f7j  added  to  the  beginning  and  6’®  added  to  the  end,  of  length  A:  +  I.  Any 
other  path  between  these  vertices  through  the  P s  uses  at  least  2k  edges,  so  that  no  F,1  color  conflicts.  Thus 
the  coloring  is  proper,  and  the  theorem  is  proved.  Q 

Second  Proof  of  2.4.1:  This  proof  is  due  to  FIolTman  (1982).  The  direct  Jacobian  approximation  problem  is 
reduced  to  finding  an  optimal  NDC. 

Given  an  m  X  n  Jacobian  sparsity  pattern  A,  consider  the  symmetric  (m  +  n)  X  (m  +  n)  matrix 


where  J  is  the  n  X  n  matrix  of  all  ones,  and  /  is  the  m  X  m  identity.  Suppose  that  there  is  a  polynomial 
algorithm  for  finding  optimal  NDCs,  and  apply  it  to  //.  Since  each  of  the  first  m  columns  of  II  overlaps 
with  all  other  columns,  each  of  the  first  m  columns  must  appear  by  itself;  but  finding  an  optimal  NDC  for 
II  then  essentially  reduces  to  finding  a  minimum  partition  of  the  last  n  columns  of  //  into  non-overlapping 
groups.  However,  such  a  minimum  partition  solves  the  direct  Jacobian  approximation  problem,  which  is 
NP-Complete.  Thus  finding  an  optimal  NDC  must  be  NP-Complete  as  well.  □ 


2.4.2.  The  Complexity  of  Finding  Optimal  SeqDCs 

We  now  consider  the  sort  of  coloring  of  £  induced  by  a  ScqDC.  The  first  color  must  be  non-overlapping; 
hence,  as  in  the  NDC  case,  no  two  color  i  vertices  can  have  a  common  neighbor.  Now,  since  overlap  in 
the  group  1  rows  no  longer  matters,  the  row  and  column  indices  in  group  1  can  be  deleted  from  II,  and 
the  group  2  columns  can  have  no  overlap  in  the  reduced  II.  In  graph  terms,  the  reduction  of  the  matrix 
corresponds  to  deleting  the  color  1  vertices  (and  their  incident  edges)  from  $  and  requiring  that  the  color  2 
vertices  have  no  common  neighbors  in  the  reduced  §.  The  color  2  vertices  are  then  deleted  from  the  graph, 
and  so  on.  Thus  a  ScqDC  with  k  groups  is  equivalent  to  a 


Sequential  i-Coloring:  A  sequential  ^-coloring  of  a  graph  5  *s  a  function  /:  V  — »  (  1, 2, . . . ,  k  }  such 
that  no  two  vertices  u  and  v  with  /(«)  =  /(»)  =  l  have  a  common  neighbor  in  the  graph  obtained 
by  deleting  all  vertices  w  with  f(w)  <  l  from  §. 


We  shall  show  that  it  is  NP-Complete  to  decide  whether  a  graph  £  has  a  sequential  3-coloring. 

Proof  of  Theorem  2.4.2:  This  proof  is  due  to  Stockmcyer  (1982).  We  shall  reduce  3SAT  to  the  problem  of 
deciding  whether  Q  has  a  sequential  3-coloring.  By  the  equivalence  between  SeqDCs  and  sequential  graph 
coloring,  the  reduction  will  show  that  finding  au  optimal  ScqDC  is  also  NP-Complete. 

Given  an  instance  P  of  3SAT,  a  graph  §  will  be  constructed  such  that  P  has  a  satisfying  truth  assignment 
if  and  only  if  there  is  a  sequential  3-coloring  of  §.  For  each  atom  u  of  P,  make  a  6-cycle  in  $  with  two 
adjacent  vertices  labelled  u  and  u,  as  follows: 


By  exhaustive  enumeration  it 
properly  and  sequentially  arc: 


can  be  verified  that 


the  only  two  possible  ways 


to  3-color  the  above  graph 


(2.4.7) 
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Connect  these  6  cycles  according  to  the  clauses  of  P  as  follows.  For  each  clause  Cy  —  (Ii.Ij.Ij  },  add 
the  nodes  and  edges 


to  § ,  where  l{,  li  and  I3  are  the  literal  vertices  on  the  ft-cycles. 

For  example,  if  H  has  four  atoms,  and  clauses  {  Sj,  «2>  u3  }  and  {  «2,  uj,  u*  },  then  the  constructed  Q  is: 


First  note  that  by  (2.4.7),  the  neighborhood  of  a  literal  l  that  is  in  some  clause  and  is  not  colored  I  must 
look  like: 


1 


In  the  second  case,  vertex  x  cannot  be  colored  any  of  1,  2  or  3  in  a  proper  sequential  3-coloring,  and  so 
every  literal  that  appears  in  some  clause  must  be  colored  I  or  2.  In  the  first  case  of  (2.4.8),  vertex  y  must 
be  colored  3,  and  hence  vertex  z  must  be  colored  1. 

Suppose  that  Q  has  Is'en  properly  sequentially  3-colorcd  and  that  all  three  literals  in  a  clause  are  colored 
2.  Then  by  the  above  remarks  about  (2.1.8),  the  clause  vertices  must  l>e  partly  colored  as  follows: 
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The  only  remaining  vertices  that  couUl  be  colored  1  are  i  and  y,  and  at  most  one  of  them  ran  be  colored 
1.  But  then  deleting  the  color  1  vertices  leaves  a  path  of  length  at  le;ist  five  (the  darker  vertices  if  x,  say,  is 
colored  l)  which  cannot  be  sequentially  2-colorcd.  Thus  every  clause  must  have  at  lc;ust  one  literal  colored 
1.  If  a  truth  assignment  is  associated  with  the  coloring  by  setting  atom  true  if  vertex  u,  is  colored  1, 
and  false  otherwise,  then  the  existence  of  the  sequential  3-coloring  of  §  implies  that  P  has  a  satisfying  truth 
assignment. 

Conversely,  suppose  that  P  has  a  satisfying  truth  assignment  r.  Color  literal  vertex  l  of  y  with  1  if 
t(1)  =  true,  and  with  2  otherwise.  Arbitrarily  extend  the  coloring  as  in  (2.4.7)  to  the  rest  of  the  atomic 
6-cycles.  Because  of  the  asymmetry  of  the  clause  subgraphs,  there  are  five  cases  to  consider  m  showing  how 
to  color  the  clause  subgraphs,  depending  on  which  subset  of  the  literals  is  true.  The  five  cases  and  their 
colorings  are: 


h 

1  4  2 

1  ^  2 

i 

It 

1, 

2. 

i 

,  3 

2,1  3< 

3,  3 

3 

3 

3t 

.  2 

3,  1) 

2*  1, 

2 

2 

i* 

3 

1.  3. 

3«  3 

1 

3| 

3  4 

- . - 

1  i 

- . - *  i _ . _ . _ . _ L _ . _ i _ . _ _ 

2 

3  2 

3 

1 

3 

2 

2  3 

i 

2f 

1 

2l 

2, 

1. 

3 

3! 

| 

3< 

3< 

3 

3. 

2 

11 

1. 

2\ 

‘1 

2< 

3 

3i 

21 

31 

3, 

3. 

...  i  . . 

- 

3  2  1 

3 

3  5 

3  1 

The  only  non-trivial  case  to  check  in  verifying  that  this  coloring  is  a  proper  and  sequential  3-coloring  of  Q 
is  in  the  neighborhood  of  a  false  literal  l  that  is  in  many  clauses.  By  (2.4.7)  and  (2.4.8)  it  must  look  like: 


The  correctness  of  the  coloring  is  easily  seen.  Thus  the  existence  of  a  satisfying  truth  assignment  for  P 
implies  the  existence  of  a  proper  sequential  3-c.oloring  for  D 

2.4.3.  The  Complexity  of  Finding  Optimal  PSimDCe 

We  now  consider  the  sort  of  coloring  a  PSiniDC  give  rise  to.  By  (DCP),  for  each  unknown  hi,  of  // 
either  the  group  containing  t  cannot  have  overlap  hi  row  j,  or  the  group  containing  j  cannot  have  overlap  in 
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row  ».  Equivalently,  a  family  of  subsets  of  columns  is  not  a  direct  cover  if  there  is  an  unknown  AtJ  such  that 
there  is  overlap  in  row  j  of  the  group  containing  «,  and  there  is  overlap  in  row  t  of  the  group  containing  j. 
In  graph  terms,  the  coloring  is  improper  if  there  is  an  edge  e  =  {  i,j  }  in  Q  where  »  is  colored  c<,  j  is  colored 
cj,  i  is  adjacent  to  a  vertex  p  ^  j  also  colored  Cj  (corresponding  to  overlap  in  row  t  of  group  Cj  containing 
j),  and  j  is  adjacent  to  a  vertex  q  ^  »  also  colored  c<. 

This  “excluded  colored  subgraph”  condition  is  clearly  equivalent  to  (DC1‘),  and  so  the  next  definition  is 
equivalent  to  a  l’SimDC. 

Direct  fc-Coloring:  A  direct  fc-eoloring  of  a  graph  §  is  a  function  J:V  -*  {  1,2 . k  }  such  that  /  is 

a  coloring  in  the  usual  sense  and  there  is  no  subgraph  of  Q  colored  like: 


where  »  ^  q  and  j  p. 


Proof  of  Theorem  2.4.3:  We  shall  reduce  3CCP  to  the  problem  of  deciding  whether  Q  has  a  proper  direct 
3-coloring.  Since  direct  Ifc-coloring  is  equivalent  to  finding  an  optimal  PSimDC,  the  reduction  will  imply  that 
finding  an  optimal  PSimDC  is  NP-Complete. 

Given  a  graph  K  for  30CP,  a  graph  Q  will  be  constructed  such  that  K  has  a  3-coloring  if  and  only  if 
Q  has  a  direct  3-coloring.  First  note  that  the  4-cycle  has  essentially  only  one  proper  direct  3-coloring,  up  to 
permutation  of  colors: 


If - -2 

I  I 

3i - 11 


(2.4.9) 


By  (2.4.9),  again  up  to  a  permutation  of  colors,  there  is  essentially  only  one  way  to  color  the  graph: 


Thus  graph  L  forces  its  terminal  a  and  t  to  be  different  colors. 

To  construct  Q  from  K ,  replace  every  edge  of  K  with  L.  Thus  if 


Suppose  that  C  has  a  proper  direct  3-coloring  /.  By  (2.4.10),  if  each  vertex  of  K  is  colored  with  the  color 
received  by  its  identified  terminal  under  /,  the  resulting  coloring  must  be  a  proper  3-coloring  of  K . 

Conversely,  suppose  that  there  is  a  proper  3-coloring  /  of  K.  Then  color  each  terminal  or  Q  with  the 
color  of  its  identified  vertex  in  K ,  and  color  each  non- terminal  with  the  complementary  color  of  the  colors  of 
the  two  terminals  to  which  it  is  adjacent.  This  coloring  is  clearly  a  proper  3-coloring  of  §.  Any  path  of  four 
vertices  in  Q  must  contain  two  terminals  separated  by  a  non-terminal.  But  these  three  vertices  use  all  three 
colors,  and  hence  the  excluded  colored  subgraph  of  §  cannot  exist.  Thus  Q  has  a  proper  direct  3-coloring. 
0 

A  direct  coloring  is  called  a  “symmetric  coloring"  in  Coleman  and  More  (1982),  and  they  call  a  I’SiinDC 
a  “symmetrically  consistent  partition."  They  use  these  concepts  to  give  a  quite  different  proof  of  Theorem 
2.4.3  (see  their  Theorem  3.3  and  the  remarks  following  it). 


'X. 
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2.4.4.  The  Complexity  of  Finding  Optimal  GSimDCs 

The  fact  that  a  column  can  belong  to  more  than  one  group,  and  hcncc  a  vertex  of  §  can  receive  more 
than  one  color,  makes  the  proof  of  Theorem  2.4.4  more  difficult  than  the  proof  of  Theorem  2.4.11.  The  same 
analysis  of  the  implications  of  (DCP)  for  colorings  as  in  the  PSimDC  case  still  holds  here  except  that  the 
concept  “vertex  «  is  colored  c,"  is  replaced  by  “color  c,  is  one  of  vertex  »’s  colors.”  With  thus  insight  it  can  be 
shown  that  the  following  formal  definition  is  equivalent  to  a  GSimDC  with  k  groups,  Let  S  =  {  1,2, . . . ,  k  }. 

Direct  Jb-Multicoloring:  A  direct  multicoloring  of  a  grapii  Cj  is  a  function  /:  V  -*  2s  \  0  satisfying  (1) 
for  each  edge  e  =  {  i,  j  }  of  £,  /(s’)  D  f(j)  —  0  (this  is  analogous  to  /  being  a  coloring  in  the  usual 
sense)  and  (2)  there  is  no  subgraph  of  Q  like 

0. - 0  0® 


with  i  5 ^  q,  j  p,  f(i)  H  f(q)  7^  0  and  /(j)D  /(p)  0  (this  is  analogous  to  the  excluded  colored 

subgraph  condition  for  PSimDCs). 

Proof  of  Theorem  2.4.4:  By  reduction  from  3GCP.  Because  of  the  equivalence  of  direct  fc-multicoloring  with 
fiuding  optimal  GSimDCs  discussed  above,  it  sufTices  to  reduce  3GCP  to  the  problem  of  deciding  whether  a 
graph  has  a  proper  direct  3-rtiulticoloring. 

Given  a  graph  K  for  3GCP,  a  graph  Q  will  be  constructed  such  that  K  has  a  proper  3-coloring  if  and  only 
if  Q  has  a  proper  direct  3-mult.icoloring.  First,  note  that  by  exhaustive  enumeration,  up  to  a  permutation 
of  colors  the  only  way  to  3-multicolor  the  displayed  graph  properly  and  directly  is  as  shown: 


1 

2 


(24.11) 


In  particular,  all  vertices  of  subsequent  graphs  constructed  with  (2.1.11)  can  receive  only  one  color. 

Given  the  essentially  unique  coloring  of  (2.4. 11),  consider  how  to  extend  the  indicated  partial  coloring 
on  the  following  graph: 


Now  using  “*  =  t*  as  shorthand  for  “vertex  z  is  colored  «”  it  can  be  seen  that 


which  implies  that  path  odef  is  colored  2323,  an  improper  3-inulticoloring,  and  so  a  must  be  colored  3  and 
d  must  be  colored  2.  Now 

b  —  3  =4  c  =  1  =*e  =  3=>/  =  2, 

which  implies  that  path  ode/  is  colored  3232,  so  that  b  must  be  colored  I.  Thus  the  only  way  (up  to  a 
permutation)  to  3-multicolor  L  properly  is: 


1  3  2 


(2.4.12) 
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Note  that  as  in  (2.4.10),  L  forces  its  terminals  a  and  t  to  bo  different  colors. 
To  derive  §  from  K,  replace  every  edge  of  K  with  L.  Thus  if 


Suppose  that  /  is  a  3-multicoloring  of  Q.  By  (2.4.11),  f  must  be  in  fact  a  3-coloring  of  §,  that  is,  each 
vertex  of  §  has  exactly  one  color.  Now  color  each  vertex  of  K  with  the  color  of  its  identified  terminal  in  Q. 
Since  the  terminals  of  L  must  have  different  colors,  K  must  be  property  3-colored. 

Now  suppose  that  /  is  a  3-coloring  of  K.  Color  the  terminals  of  Q  by  the  colors  of  their  identified 
vertices  in  K .  and  color  the  non-terminals  of  Q  as  indicated  in  (2.4.12),  permuting  colors  appropriately.  The 
result  is  clearly  an  ordinary  3-coloring  of  Q.  Since  each  L  in  $  is  properly  directly  3-inulticolored,  the  only 
possible  way  for  this  coloring  of  5  to  be  an  improper  direct  3- multicoloring  is  for  a  counterexample  path  to 
have  a  terminal  as  one  of  its  two  interior  vertices.  But  the  neighborhood  of  a  terminal  vertex  v  in  Q  looks 
like: 


It  is  now  easily  seen  that  no  counterexample  path  exists.  Thus  5  can  be  properly  directly  3- multicolored. 

G 

2.4.5.  Other  Complexity  Results 

Two  remarks  are  in  order  about  the  last  two  proofs.  First,  since  3CCP  is  NP-CompIcte  even  for  planar 
K,  and  since  the  edge  replacement  graphs  L  are  themselves  planar  in  both  cases,  finding  optimum  PSimDCs 
or  CSirnDCs  for  Hessians  whose  sparsity  patterns  correspond  to  planar  graphs  is  NP-Complete.  Though 
this  fact  has  little  practical  significance  in  itself,  it  docs  have  an  interesting  corollary.  It  is  well-known 
that  a  planar  graph  on  n  vertices  can  have  at  most  3n  -  6  =  0(n)  edges  (see  Bondy  and  Murty  (I97fi), 
Corollary  9.5.2).  Thus  the  intractability  of  finding  optimal  PSimDCs  and  CSirnDCs  is  not  due  to  requiring 
algorithms  to  process  nearly  dense  Hessians.  In  particular,  it  is  also  NP-CompIcte  to  find  optimal  PSimDCs 
and  CSirnDCs  for  Hessians  with  0(n)  unknowns  (with  density  0(l/n)). 

Second,  the  proof  of  Theorem  2.4.4  shows  that  for  every  graph  5  resulting  from  the  reduction,  any 
CSimDC  To'  the  corresponding  Hessian  which  haa  only  three  groups  must  in  Tact  be  a  PSirnDC.  Thus  the 
proor  or  Theorem  2.4.4  also  proves  Theorem  2.4.3  as  a  corollary.  However,  since  the  proof  of  Theorem  2.4.3 
given  is  quite  simple  and  is  a  useful  warm-up  for  the  proof  of  Theorem  2.4.4,  it  was  included  despite  its 
technical  redundancy. 

The  complexity  of  substitution  methods  can  also  be  analysed  through  graph  coloring.  Coleman  and 
More  (1982)  consider  a  particular  subclass  of  substitution  methods  called  lower-triangular  substitution 
methods  defined  originally  in  Powell  and  Toint  (1979).  These  methods  bear  roughly  the  analogous  relation 
to  general  substitution  methods  as  ScqDCs  do  to  general  direct  covers.  Coleman  and  More  show  that  finding 
an  optimal  set  of  difference  directions  for  a  lower-triangular  substitution  method  is  equivalent  to  a  certain 
kind  or  graph  coloring  that  they  call  triangular  coloring,  and  use  this  equivalence  to  prove  the  following 
(see  their  Theorem  7.2). 

Theorem  2.4.9:  Finding  an  optimal  lower-triangular  substitution  set  of  difference  directions  is  NP- 
Complete.  J 
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2.4.6.  Heuristic  Approaches  to  Direct  Methods 


The  NP-Complctencs»  theorems  in  this  section  are  rather  discouraging,  since  the  conventional  wisdom 
is  that  NP-Complctencss  is  tantamount  to  intractability.  On  a  more  positive  note,  much  work  has  been 
done  on  finding  near-optimal,  polynomial- time,  heuristic  algorithms  for  NP-Compicte  problems  (sec  Garcy 
and  Johnson  (1979),  chapter  6). 

In  the  present  case,  the  most  obvious  heuristic  approach  is  to  reduce  D2GCP  to  GCP  and  then  apply 
known  heuristic  results  on  GCP  to  the  reduced  graph.  Given  a  graph  §  —  ( V,E ),  define  (the 

distance-2  completion  of  y)  to  be  the  graph  on  the  same  vertex  set  V ,  and  with  edges  bJ  —  {  {  i,j  }  ; 
iand  j  arc  distance  2  or  less  apart  in  5  }•  Equivalently,  when  the  vertex-vertex  adjacency  matrix  A  of  Q  has 
a  non-xero  diagonal,  then  0^(9)  is  the  graph  whose  adjacency  matrix  is  A2.  A  third  equivalent  formulation 
is  that  D-i(5)  is  the  intersection  graph  (see  Golumbic  ( ldSO),  Section  1.2)  of  the  columns  of  its  adjacency 
matrix.  It  is  easy  to  verify  that  a  coloring  of  V  is  a  proper  distancc-2  coloring  of  9  if  amt  only  if  it  is  a 
proper  (distance-1)  coloring  of  0-^9)  (note  that  this  reduction  also  implies  that  DlGCP  is  Ni'-Coinplcte). 

If  there  were  a  “good”  heuristic  for  GCP,  it  could  be  composed  with  D 2(s)  to  obtain  a  “good”  heuristic 
for  D2GCP.  Coleman  and  More  (1981),  Section  4,  gives  a  good  overview  of  the  present  state  of  the  art 
in  GCP  heuristics,  which  is  not  “good”.  In  fact,  if  {§)  denotes  the  number  of  colors  used  by  the  best 
known  heuristic  on  graph  and  x(5)  denotes  the  optimal  number  of  colors  necessary  for  j  (its  chromatic 
number),  then  in  the  worst  case 


max 

ponn  vertices 


cH($) 

x{5) 


=  0(«a), 


(2.4.13) 


where  a  —  1  —  *K!3t'  known  heuristic  and  the  bound  (2.4. 13)  are  due  to  Widgersou  (1982);  sec  also 

Johnson  (1974)  for  worst  case  analysis  of  other  graph  coloring  heuristics).  Two  facts  mitigate  the  severity 
of  (2.4.13).  First,  the  range  of  Dz(»)  does  not  include  all  graphs,  and  hence  a  better  bound  than  (2.4.13) 
can  be  obtained  for  D2GCP.  Second,  average-case  results  have  been  obtained  for  GCP  heuristics  that  arc 
considerably  better  than  (2.4.13). 

To  improve  on  (2.4.13)  for  D2GCP,  consider  the  specific  heuristic  called  the  distance-2  sequential 
algorithm  (D2SA).  Define  M(i)  —  {}  i  \  j  is  distance  <  2  from  i  },  the  distancc-2  neighborhood 
of  a  vertex  »  in  a  graph.  Thus,  if  t  has  color  c  in  a  proper  distance-2  coloring,  no  j  £  H(i)  can  be  color  c. 
Then  D2SA  assigns  color 

min{  c  >  1  |  no  J  £  >/(*),  j  <  *,  is  colored  c  } 


to  vertex  »,  *  =  1,.,.,'V  .  That  is,  D2SA  assigns  vertex  *  the  smallest  color  not  conllicling  with  those 
already  assigned.  (D2SA  is  just  the  distancc-2  version  of  the  best  known  GCP  heuristic,  the  sequential 
algorithm,  which  is  called  the  CPR  method  in  its  applications  to  approximating  sparse  Jacobians;  see 
Curtis,  Powell  and  Reid  (1974).)  Let  cs(§)  denote  the  number  of  colors  used  by  D2SA  when  applied  to  §. 

In  order  to  obtain  bounds  on  c5(£),  two  definitions  are  required.  The  maximum  degree  of  A[9), 
is  defined  as 

A(S)  =  max|{  j  j  {i,j}  £  K{9)}\. 


The  distance-2  chromatic  number  of  Q,  Xi{5)>  •*  defined  as  the  optimal  number  of  colors  in  a  proper 
distancc-2  coloring  of  Q, 


X2 (5)  —  min(fc  |  Q  has  a  proper  distance-2  coloring  with  k  colors}. 

The  following  theorem  bour  is  Xa{5)  and  cS(5)  *n  terms  of  A(£),  and  a  corollary  improves  (2.4.13)  for 
D2SA: 

Theorem  2.4.  lOi  Let  d  =  A($).  Then 

d  +  1  <  xs(S)  <  e*(5)  <  #  +  1  (2.4.14) 


for  all  graphs  Q. 

Proof:  l,ct  *  be  a  vertex  incident  to  exactly  d  edges,  and  note  that  i  and  its  d  nearest  ncighliors  must  all  lie 
different  colors  in  a  proper  distance-2  coloring;  this  proves  the  lower  bound  in  (2.4.14).  The  second  inequality 
in  (2.4.14)  is  trivial. 
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To  prove  the  upper  bound  iu  (2.4.14),  note  that  for  any  vertex  »',  \M(i)\  <  d  +  d(d  —  t)  =  rf*.  Suppose 
that  D2SA  assigns  color  l  to  vertex  »;  by  definition  of  D2SA,  color  l  is  assigned  only  if  at  least  one  vertex 

of  each  color  I . /  -  1  is  in  M(i).  Thus,  if  t  were  assigned  color  l  >  d*  +  1,  then  |>/(i)|  >  d*  +  I  (a 

contradiction).  (This  proof  is  essentially  a  constructive  proof  of  Corollary  8.2.1  in  Bondy  and  Murty  (1978).) 

0 

Corollary  2.4.11:  For  all  n  >  1, 


max 


III0A  .  .. 

S  on  n  vcrticei  X  j(y  ) 


CS(^  <  V'n’^7  +  1  = 


(2.4.15) 


Proof:  Clearly,  es(§)  <  n.  Let  k  =  Xt(9)-  Applying  the  first  and  third  inequalities  of  (2.4.14)  yields 


's(9)  <(*-  i)4  + 1. 


(2.4.16) 


Consider  two  cases: 

Case  1:  If  n  <  (k  -  1)*  +  1,  then  y/n  —  l  +  1  <  Jfc  and  so 


cS{9)  <  *»  <  * 

k  ~  k~  Vn-~T  +  l 


<  \/n-  1  +  1. 


Case  2:  If  n  >  (k  -  l)*  +  1,  then  k  <  y/n  -  1  4-  1,  and  so 


cs(9) 

k 


< 


(*- 0*  +  l  ,  „  2 
- k - -*"2  +i<k<y/n 


1  +  1. 


(Corollary  2.4.1 1  is  essentially  (4.6)  of  Coleman  and  Mor4  (1981)  in  the  special  case  that  the  matrix  is  square 
and  symmetric).  C 

Craphs  that  attain  bound  (2.4.16)  for  a  certain  ordering  of  their  vertices  exist  for  k  —  1,2, 3, 4.  The 
cases  i fc  =  1,2  arc  trivial.  For  k  =  3,  consider  §3  —  ( V3,  /ij)  defined  by 


Vj  —  xy  i  —  1,2,3,  j  —  1,2, 3,4, 5, 

K3  =  {  ZijfXi+tj+t  }  all  i,j  (subscripts  modulo  3  and  5). 

Then  D2SA  assigns  xt}  color  «  when  the  vertices  arc  ordered  by  i  (which  is  optimal  by  (2.4.14)),  and  assigns 
Zij  color  j  when  the  vertices  arc  ordered  by  j  (which  is  the  worst  possible,  by  (2.4.16)).  For  k  —  4,  consider 
=  (V4,  fin)  defined  by 

V4  =  Zij  i  —  1, 2, 3, 4,  j  —  1, . . . ,  10, 

({*0')*i+lj+5  }  j  1 

{ x1+iiJ+j  }  all  odd  j  >  all  t  (subscripts  modulo  4  and  10). 

{*«>, even  j) 

Then  D2SA  applied  to  §4  also  colors  Zij  with  i  when  ordered  by  t,  an<l  with  j  when  ordered  by  j  (which 
are  again  respectively  optimal  and  worst  possible). 

Extending  this  construction  seems  to  be  extremely  difficult.  Its  extension  appears  to  be  roughly  equiv¬ 
alent  to  solving  a  hard  open  problem  in  extremal  graph  theory  (sec  Bollabas  (1978),  Section  IV. I).  Even  if 
it  could  be  extended,  the  number  of  vertices  is  given  by  r»  =  fc((fr  —  l)2  +  l),  so  that 

=  0(n1'*),  (2.4.17) 

which  is  a  better  result  than  (2.4.15).  Thus,  while  (2.4.14),  (2.4.15)  and  (2.4.16)  are  better  results  than 
(2.4.13),  we  conjecture  that  (2.4.17)  is  also  a  (better)  bound. 

Turning  from  the  worst  case  to  the  average  case,  Crimmet  and  McDiarmid  (1975)  proved  the  following 
theorem: 
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Theorem  2.4.12t  Fix  n  vertices,  and  let  vertices  i  and  j  be  independently  connected  by  an  «-<lgo  with  fixed 
probability  p,  0  <  p  <  1.  Let  c('(Q)  be  the  number  of  colors  used  by  CPU  on  Q,  and  v(s)  be  the  optimal 
number  of  colors  (so  tliat  cc(Q)  and  x(5)  arP  random  variables).  Then 

<2  1 1 

for  all  t  >  0  with  probability  1  —  o(l).  D 

Thus,  on  average,  CPU  almost  never  performs  more  than  twice  as  badly  ;ia  the  optimal  strategy.  This 
theorem  has  at  least  two  unsatisfactory  features  in  this  context.  First,  sparsity  patterns  in  practical  problems 
arc  not  uniformly  random  as  assumed  in  Theorem  2.4.12.  Second,  even  if  they  were,  the  density  of  sparsity 
patterns  tends  to  be  0(l/n)  rather  than  constant  with  increasing  n.  It  would  be  useful  to  determine  a 
better  random  model  for  sparsity  patterns,  or  at  least  to  prove  Theorem  2.4.12  under  the  assumption  that 
P  =  0(1 /n). 

The  weakness  of  such  random  models  in  predicting  practical  performance  is  illustrated  by  ;hc  computa¬ 
tional  experiments  of  Coleman  and  More  (1982)  with  a  version  of  a  CPU  heuristic,  namely  (  Pit  applied  to 
the  sparsity  pattern  with  the  columns  in  the  smallest-last  ordering  (see  Coleman  and  .More  (1981)  for 
details).  In  their  Table  4.1,  column  “maxr”  represents  a  lower  bound  on  Xn(5)»  and  column  “si”  represents 
cs(9),  using  the  smallest-last  ordering.  When  averaged  over  the  30  real  examples  that  they  tested,  cs(§) 
used  14.43  colons  whereas  the  lower  bound  averaged  13.6  colors.  Thus  the  improved  CPU  used  ess  than 
one  extra  color  above  the  optimal  on  average,  and  fS(5)  averaged  at  most  6.1%  larger  than  a  big 

improvement  over  Theorem  2.4.12. 

2.5.  Lower  Bounding  Elimination  Methods 

The  computational  experiments  reported  in  Coleman  and  More  (1982),  Tables  4.1  and  8.1  are  quite 
intriguing.  For  the  30  practical  problems  on  which  they  tested  various  heuristics,  the  best  NDC  heuristic 
used  14.43  groups  on  average,  the  best  ScqDC  heuristic  used  11.63  groups  and  the  best  lower-triangular 
substitution  heuristic  used  7.87  groups.  This  progression  leads  to  speculation  about  the  minimum  possible 
number  of  difference  directions  for  a  given  sparsity  pattern,  and  whether  there  is  a  polynomial  algorithm  to 
compute  it.  The  corresponding  problem  for  sparse  Jacobians  is  relatively  easy;  sec  Ncwsnm  and  Itamsdell 
(1982),  Theorem  3- 

The  ultimate  lower  bound  on  the  number  of  difference  directions  necessary  to  approximate  a  given  //, 
call  it  ~i(II),  is  the  minimum  number  needed  by  an  elimination  method,  since  these  methods  allow  complete 
freedom  in  choosing  the  d‘ .  This  section  presents  some  results  that  give  various  lower  bounds  for  and 

presents  some  evidence  that  the  best  lower  bound  is  polynomially  computable.  It  is  conjectured  that  the 
best  lower  bound  is  tight  for  every  H. 

It  must  be  emphasised  at  the  outset  that  elimination  methods  are  studied  here  not  because  they  are 
claimed  to  lie  in  any  sense  practical.  Instead  the  aim  is  to  formulate  a  procedure  whereby  it  can  be  easily 
checked  how  far  the  substitution  heuristics  are  from  l(II)-  IT  these  heuristics  are  found  to  t>e  close  to  ~t(U)  on 
average,  the  practicality  of  the  heuristics  evidently  makes  further  work  on  elimination  methods  in  practice 
unappealing.  Alternatively,  if  there  is  to  be  a  significant  gap  between  the  substitution  heuristics  and  '>(//), 
it  would  be  justified  to  investigate  whether  there  are  practically  implcincntablc  elimination  heuristics  which 
out-perforin  the  substitution  heuristics. 

In  this  section  we  shall  not  assume  that  the  sparsity  pattern  of  //  has  a  non-zero  diagonal.  Also,  II  and 
its  associated  graph  will  be  referred  to  interchangeably,  so  that  it  will  make  sense  to  write  that  //  is  bipartite. 
For  case  of  referral  equations  (2.2.1)  are  reproduced  here,  deleting  the  hat  on  II  and  the  de(>ondence  on  x° 
for  simplicity: 

Hdl^Al,  /  ==  1,2, .  .  ,fc.  (2.5.1) 

(Recall  that  A1  is  defined  as  (?/(x°  +  dl)  —  y(xn))r .)  Note  that  (2.5.1)  is  a  set  of  nk  linear  equations  in  the 
("j1)  unknowns  A|  i.Azii  h-ii,  fcji, . .  • ,  A„i,  fe»2,  ■  •  ,hnn.  Denote  the  coefficient  matrix  of  (2.5.1)  by  An  k. 


cc(5) 

x(5) 
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2.5.1.  A  General  Lower  Bound  on  Evaluations 

In  the  situation  of  interest,  sparsity  causes  many  of  the  unknowns  to  be  deleted,  that  is,  causes  many  of 
the  columns  in  A"'*  to  be  deleted.  However,  assume  for  the  moment  that  H  is  completely  dense.  Then  by 
simple  linear  algebra,  the  largest  number  of  unknowns  that  can  be  solved  for  by  a  subsystem  of  equations 
(2.5.1)  is  equal  to  the  rank  of  An'*.  Suppose  that  it  can  be  shown  that  the  maximum  rank  possible  for 
/!“•*  is,  say,  rn*.  Note  that  rnt  would  have  to  be  increasing  in  k.  As  sparsity  comes  into  play,  columns  are 
eliminated  from  An'k,  and  its  rank  can  only  decrease.  Thus,  even  for  a  sparse  //,  the  maximum  number 
of  unknowns  that  can  be  solved  for  is  still  at  most  r„*.  By  this  reasoning  a  lower  bound  on  can  be 
calculated  as  follows.  Denote  the  number  of  unknowns  in  //  by  q,  and  the  smallest  k  such  that  r„*  >  q  by 
k'.  Then  at  least  k‘  evaluations  are  necessary  to  approximate  H .  Thus  we  shall  now  focus  our  attention  on 
determining  rnt. 

The  rank  of  An'k  is  affected  by  the  numerical  values  in  the  df .  By  assuming  that  the  d*  satisfy  the  Haar 
condition,  namely  that  every  square  submatrix  of  the  matrix  whose  Ith  column  is  ef*  is  nonsingular,  the 
rank  of  An,k  is  maximised.  (The  Haar  condition  is  implied  by  the  assumption  that  the  entries  of  the  d*  are 
independent  algebraic  indcterminales.  It  is  also  implied  by  the  assumption  that  the  d*  are  perturbed  from 
their  given  values  by  infinitesimals,  similar  to  the  construction  often  used  in  non-degeneracy  proofs.)  For 
example,  choosing  the  d1  as  columns  from  a  Vandermonde  matrix  (see  Knuth  (1973),  Section  1.2.3,  exercises 
36  45)  satisfies  the  Haar  condition. 

To  determine  rn*,  we  investigate  the  structure  of  A"'*.  Label  the  *th  row  or  the  set  of  equations 
associated  with  A'  by  “A{,”  and  label  the  column  corresponding  to  variable  *,-y  by  “*,-y."  For  n  =  4  and 
k  =  2,  A4’*  has  the  form 


*11  *21  *22 

*21  *22  *22 

*41  *42  *42  *44 

A! 

(4  |  4 

4 

A? 

d*  !  d\ 

1 

4 

4 

A{ 

d\  4 

4 

4 

A* 

4  4 

4 

4 

AJ 

"  ! 

i 

d\  4  4 

4 

A? 

i 

t 

4  4  4 

4 

AJ 

1 

1 

4  4  4  4 

AJ 

i 

4  4  4  4  j 

In  general  the  entry  in  row  AJ,  column  *j,  of  An’*  is 


(2.5.2) 


it  q  =  i, 

it  q  —  j,  (2  5- 

0  otherwise. 

Besides  yielding  a  lower  bound  on  q (H),  the  determination  of  r„*  is  also  interesting  from  another 
point  of  view.  Hacli  additional  gradient  evaluation  yields  n  more  seemingly  independent  linear  equations. 
Because  of  symmetry  there  are  only  ("j  ')  =  1  *  unknowns  in  a  dense  Hessian.  Thus  it  might  appear 

that  only  1  evaluations  would  suffice  to  approximate  a  dense  Hessian,  since  the  number  of  equations  = 
n  •  !nyi’  >  ("j')  q.  Perhaps  then  the  number  or  gradient  evaluations  needed  even  Tor  a  dense  Hessian 
could  be  reducer!  below  n  by  a  clever  choice  of  difference  directions.  However,  the  next  theorem  shows  that 
such  savings  are  not  |H>ssib]e.  The  theorem  appears  to  be  well-known  in  the  folklore,  but  we  know  of  no 
published  proof. 


s 
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Theorem  2.5.1s  The  maximum  number  of  unknowns  that  ran  be  determined  by  a  set  of  gradient  evaluations 
along  any  k  directions,  0  <  A;  <  n  is  given  by 


rnk  —  n  +  (n  -  l)  r  i-  (»  -  k  +-  1) 

— -CT') 

In  particular,  n  evaluations  are  necessary  to  obtain  all  (n.£ l)  unknowns  in  the  completely  dense  case. 
Bound  (2.5.4)  is  sharp  for  some  sparsity  patterns. 

We  shall  give  two  different  proofs  of  Theorem  2.5.1,  the  first  a  column-oriented  proof,  the  second  a 
row-oriented  proof.  The  column-oriented  proof  is  more  direct  since  it  exhibits  an  explicit  subset  of  rB*  of 
the  unknowns  which  form  a  basis,  but  the  row-oriented  proof  is  simpler. 

Column-oriented  Proof  of  Theorem  2.5.1:  Partition  the  rows  of  An,k  as  in  (2.5.2)  into  n  row  blocks  of  k 
equations  each,  the  >lh  row  block  consisting  of  equations  A',  Aj, . . . ,  A*.  Partition  the  columns  of  An,k  as 
in  (2.5.2)  into  n  column  blocks  of  l,2,...,n  unknowns  each,  the  jlh  column  block  consisting  of  columns 
. . . ,  hj; .  Let  A"’k  denote  the  i,jth  submatrix  of  this  partition.  To  simplify  notation,  define  e*  as 
the  Jfc-veclor  of  the  jlh  components  of  the  {  (P  },  *.e.,  c*  —  . .  ,dk)T,  j  =  1, . . . ,  n.  Then  (2.5.2)  and 

(2.5.3)  imply  that  each  A”'k  is  completely  described  by 


(2.5.4) 


0, 

(cl,c*,...,c>), 
(0,0,..., ^,...,0), 
12  i  j 


if*  >  j; 
if  *  =  j; 
if  *  <  j. 


(2.5.5) 


To  complete  the  proof,  it  must  be  shown  that  rank(/l"’*)  —  r„t*.  bet  If  be  the  set  of  columns  5,(y  with 
k  <  j  <  i  <  n,  and  let  F  lie  the  complementary  set  of  columns.  We  shall  show  that  the  columns  of  F  can 
be  used  to  eliminate  the  columns  of  If.  Note  that  jA'|  =-  rn,*,  and  that  each  column  in  F  .evolves  only  c‘ 
with  »  <  k 

Define  X1  to  be  the  solution  of  the  system 


(c*  c*  •••  c*)X*  =  —cl,  l  -  k+  l,k  +  2,...,n 


(X1  exists  and  is  unique  under  the  assumption  of  the  llaar  condition).  The  following  computations  show  that 
linear  combinations  of  the  columns  in  F,  using  the  {  X*  }  as  multipliers,  ran  lie  used  to  eliminate  the  columns 
in  E\  since  the  form  of  the  linear  combinations  is  complicated,  the  result  is  best  understood  by  referring 
to  the  following  example  (2.5.6),  which  is  (2.5.2)  re-written  in  the  new  notation  with  the  multipliers  defined 
below  for  the  case  that  k  =  2,  q  —  4  and  p  =  3: 

"II  Aji  ha  hji  tin  Xjj  i  hi  2  hu  hn 

x;x;  xjx;  xjxj  x;  xj  i 

XjXj  XjX(  XjXj  +  XjX)  X.jXj  »  XjX)  X)  Xj  Xj  X'J  I 


/ 


V 


(2.5.6) 


Let  p  and  q  satisfy  k  <  p  <.  q  <  n,  so  that  /i,t,  and  /i,.p  arc  typical  columns  in  tf.  To  eliminate  column 
from  An,k,  add  to  it  Xj  Limes  column  j  =  I,  . ,  k,  and  X’X’  times  column  hXti,  i  —  l,...,fc, 
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j  =  !,...,»  (these  multipliers  are  the  first  line  of  X’s  in  (2.5.6)).  From  (2.5.5)  and  (2.5.6)  the  resulting 
column  is  *ero  in  row  blocks  t,  fc  <  i  <  n,  t  ^  q,  since  no  column  with  a  non-xero  coefficient  is  non-sero 
in  these  row  blocks.  In  row  block  i,  1  <  »  <  fc,  the  non-tcro  contributions  to  the  resulting  column  are 
from  column  block  i,  X^e*  from  column  block  /,»'</<  fc,  and  X*c*  from  column  block  q, 

for  a  total  of 


In  row  block  q,  the  only  non-sero  contribution  is  from  column  block  q,  which  is 

y'  x*c* + c*  =  o. 

I  s* 

Thus  the  resultant  column  is  sero  and  column  h9t9  is  indeed  dependent  on  the  columns  of  F. 

Now  column  h9tP  is  eliminated  using  the  columns  in  F.  Add  to  it  Xj  times  column  h9j,  j  =  1 . fc, 

Xj  times  column  hp_ j  =  and  (X*Xj  -t-  XfXj)  times  column  hij,  i  =  !,...,! fc,  j  —  1 . »'  (these 

multipliers  are  shown  in  the  second  row  of  X’s  in  (2.5.6)).  There  is  no  non-sero  contribution  to  the  resultant 
column  in  any  column  block  s’,  k  <  t  <  n,  t  p,q.  In  row  block  s',  1  <  s’  <  fc,  there  is  a  contribution  of 
-*■  Xj’X’Jr'  from  column  block  s’,  a  contribution  of  (X?Xf  4 -XjX’Jc*  from  column  block  /,  s’  <  l  <  fc, 
a  contribution  of  X’cp  from  column  block  p,  and  a  contribution  of  Xfc*  from  column  block  q,  for  a  total  of 


In  row  block  p,  there  is  a  contribution  of  from  column  block  p  and  a  contribution  of  c*  from 

column  b'ork  q  for  a  total  of 

53  Xjc1  +  c*  =  0. 

(S* 

In  row  block  q,  the  only  non-sero  contribution  is  from  column  block  q  and  is 

53  Xfe*  +  c*  =  0. 

Once  again  the  row  block  totals  are  all  sero,  so  that  column  h9lP  is  also  dependent  on  the  columns  in  F. 

Eliminating  the  1  +  2  +  •  •  •  +  (n  —  fc)  columns  of  E  shows  that  rank(/V*,k)  <  rn%k.  To  show  that 
rank(An’*)  ■=  r„*,  delete  the  columns  of  E  from  An,k,  and  delete  the  last  fc  —  i  rows  from  each  row  block  s’, 
s  <  fc.  The  remaining  matrix  is  rn>*  by  r„,*  and  is  block  upper  triangular  with  square,  non-singular  diagonal 
blocks.  Thus  this  subi.iatrix  of  An-k  is  non-singular,  and  hence  rank(/tn,k)  >  rn,k- 

To  show  that  this  bound  is  sharp,  consider  the  sparsity  pattern  which  has  as  an  unknown  whenever 
»  <  fc  or  j  <  fc  Hy  letting  <P  be  the  «th  unit  vector  for  i  =  1, 2, . . . , fc,  all  rn,»  of  the  unknowns  can  clearly 
be  solved  for.  and  thus  the  bound  of  the  theorem  is  attained.  D 


Row -oriented  I'roof  of  Theorem  2.5.1:  This  proof  Is  due  to  llolTinan  (1962). 

For  each  2-subscl  { i,j  }  C  (1,2,  ....fc}  define  an  nfc-vector  z’-’  with  entries  indexed  with  the  same 
labels  as  the  rows  of  ‘  by 


if  p  =  Aj, 
if  p  =  Aj, 
otherwise. 


(2.5.7) 


We  now  show  that  zy  is  in  the  null  space  of  the  columns  of  An,k,  i.e.,  z,}An,k  -  0.  Take  columns  /»(  |  and 
^21  of  dn  k  as  representative  examples.  My  (2. 5. 6),  column  hu  is  non-xcro  only  in  rows  A*,  /  =  1 , 2, . . . ,  fc. 
Comparing  with  (2.5.7),  z,J  and  column  hu  sec  both  non- zero  only  in  rows  A‘,  and  Aj.  In  row  AJ,  z’-7  is 
dj  and  column  h , ,  is  dj,  and  in  row  Aj,  zl*  i:  -dj  and  column  f»u  is  dj,  so  that  the  value  of  the  product 
is  djdj  -  djdj  —  0,  as  desired. 
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Similarly,  by  (2.5.3)  and  (2.5.7)  the  only 


non-xero  contributions  to  tin- 


product  for  column  h2t  are 


from  row  A\  A^ 

in  2,J  d{  (f2 

in  column  (f2  d\ 


Ai  Ai 
—d\ 

4  4 


which  again  totals  xero,  as  desired. 

The  next  claim  is  that  the  z'3  are  independent.  Let  Z  be  the  nk  X  [2)  matrix  with  columns  indexed 
by  2-subsets  of  {  1, 2, . . . ,  k  }  whose  {  i,j  }th  column  is  z'3 .  It  is  necessary  to  show  that  Z  hits  full  rank.  Set 
dl  =  e1,  the  Ith  unit  vector,  so  that  d?p  =  1  if  p  =  l,  and  is  0  otherwise.  Choosing  d1  in  this  way  could  only 
decrease  the  rank  of  Z.  Let  {  «,  j  }  be  a  2-subset  of  (1,2, ...,!:}  and  consider  row  A)  of  Z.  From  (2.5.7), 
entry  {» ,j  }  of  row  A)  of  Z  is  ±1,  and  every  other  entry  is  xero.  Thus  the  submatrix  of  Z  consisting  of  rows 
A)  for  all  subsets  {  i,j  }  is  diagonal,  and  Z  has  full  rank. 

The  rank  of  the  null  space  of  the  rows  of  An,k  is  therefore  at  least  the  number  of  columns  of  Z,  namely 
(J),  and  so  the  rank  of  An,k  can  be  at  most  nk  —  (£).  To  show  that  rank  An,k  >  »•„*,  consider  the  sparsity 
pattern 


where  C  is  a  k  X  k  dense  matrix,  and  D  is  an  k  X  (n  -  k)  dense  matrix.  It  has  rnk  unknowns.  By  choosing  the 
df  as  above,  all  of  its  unknowns  are  clearly  determined  by  k  evaluations,  and  consequently  the  corresponding 
set  of  equations  must  have  rank  at  least  rnk.  But  the  set  of  equations  arising  from  such  a  sparsity  pattern 
has  a  coeilicicnt  matrix  which  is  a  submatrix  of  An,k,  and  so  rank/ln'fc  =  rn*.  □ 


2.5.2.  A  Bipartite  Lower  Bound  on  Evaluations 


In  considering  sparsity  patterns  with  some  (or  even  all)  xero  diagonal  entries,  it  ts  possible  to  obtain  a 
sharper  lower  bound  than  that  of  Theorem  2.5.1,  by  considering  bipartite  sparsity  patterns,  i.e.,  sparsity 
patterns  whose  associated  graphs  arc  bipartite. 

A  sparsity  pattern  is  bipartite  if  and  only  if  it  has  a  principal  permutation  so  that  its  structure  looks 


like 


(2.5.8) 


where  C  is  an  s  X  t  matrix.  Such  a  Hessian  can  clearly  be  approximated  by  at  most  min(s,  t)  gradient 
evaluations,  by  differencing  along  either  the  first  s  or  the  last  t  unit  vectors. 

When  the  matrix  C  in  (2.5.8)  is  completely  dense,  call  the  coefficient  matrix  of  the  equations  (2.5.1)  Bk. 
As  was  the  case  with  A"’*,  the  maximum  number  of  unknowns  that  can  be  determined  by  k  evaluations  of 
a  sparse  bipartite  Hessian  is  bounded  above  by  the  rank  of  Bk.  The  next  theorem  is  the  bipartite  analogue 
of  Theorem  2.5.1. 


Theorem  2.5. 2t  The  maximum  number  of  unknowns  of  a  sparse  bipartite  Hessian  (as  in  (2.5.8))  that  can 
be  determined  from  a  set  of  gradient  evaluations  along  any  k  directions,  0  <  k  <  min(a,l),  is 


r.tk  =  («  +  t)k  -  fc*. 


In  particular,  when  the  matrix  C  in  (2.5.8)  is  completely  dense,  min(s,  t )  evaluations  are  needed  to  obtain 
all  at  unknowns.  This  bound  is  sharp  for  some  sparsity  patterns. 

Proof:  This  proof  uses  the  same  ideas  as  the  row-oriented  proof  of  Theorem  2.5.1.  We  shall  show  that 
rank  Hk  =  r.tk. 

Denote  the  first  a  indices  of  H  by  6’1 , 52, . . . ,  Sa,  and  the  last  l  indices  by  Tl ,  7’2, . . . ,  Tt.  The 
columns  of  Iik  are  labelled  with  for  i  —  1,2, ...,«  and  j  =  1 , 2, . . . ,  i,  and  the  rows  with  Ap,  p  = 

Sl,S2, . . .  ,Sa,Tl,T2, . . .  ,Tt,  l  =  1,2,  ...,fc.  The  entry  in  row  A{,,  column  hsiTj  of  Hk  corresponding  to 
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1 

*  K  ’ 

» 

I 

I 


>w'\ 

>_ ' 

1 


P 


r 


H 


M 


(2.5.3)  equals 


by 


«?- 


4> 

if  p  =  Si, 

4, 

if  V  —  Tj, 

0 

otherwise. 

1  €  {1,2, 

. . . ,  Jfc  },  define  an  nifc-vector 

4, 

if  p  =  A*5,  and  i  <  j, 

~d3Tq 

if  p  =  A'Tq  and  *  >  j, 

-4, 

if  P  =  At,  and  »  <  j, 

4, 

if  p  =  A^  and  i  >  j, 

0 

otherwise. 

(2.5.9) 


(2.5.10) 


Note  that  this  definition  is  consistent  when  »  =  j. 

The  first  claim  is  that  z'3 lik  —  0,  which  is  verified  for  a  typical  column  hsuTv  Assume  without  loss  of 
generality  that  t  <  j.  From  (2.5.9)  and  (2.5.10)  the  contributions  to  the  product  are 


from  row  A‘Su  AiJ.. 

in  z'3  4u  -<4* 

in  column  hsuTv  d.'T%  d3Su 

and  the  total  product  is  zero,  as  claimed. 

The  second  claim  is  that  the  z'3  arc  independent.  As  before,  let  Z  be  the  nk  X  k2  matrix  whose  columns 
arc  the  z'3 ,  and  set  d1  =  esi  +■  eTl,  so  that  dp  —  I  if  p  —  SI  or  77,  and  is  0  otherwise.  Choosing  these  df 
could  only  decrease  the  rank  of  Z.  Let  i,j  £  {1,2,...,*:}  where  i  <  j,  say,  and  consider  row  A^y  of  Z. 
From  (2.5.10)  column  z'3  is  +1  in  row  A<jy,  and  every  other  entry  is  zero.  If  t  >  j  consider  row  A^.y,  where 
column  z'3  is  -  I  and  all  other  entries  arc  zero.  Since  this  subset  of  rows  picks  out  a  diagonal  submatrix  of 
Z,  it  has  full  rank 

Because  Z  has  k2  columns,  rank  Bk  <  nk—k2.  Now  consider  the  Hessian  with  bipartite  sparsity  pattern 


(  0 

CT  et 

\DT  0 


C  D\ 
E  0 
0 

/ 


where  C  is  dense  and  k  X  k,  D  \s  dense  and  k  X  {t  —  k),  and  E  is  dense  and  («  —  k)  X  k.  It  has  k(s  + 1)  —  k2 
unknowns.  AM  of  its  unknowns  can  be  approximated  with  only  k  evaluations  by  using  the  d1  defined  above. 
By  the  same  reasoning  as  in  the  proof  of  Theorem  2.5.1,  it  follows  that  rank  Bk  =  nk  —  k *.  0 


2.5.3.  Examples  of  Lower  Bounding 

We  now  g  ve  some  examples  of  how  Theorems  2.5.1  and  2.5.2  arc  used  to  calculate  lower  bounds  for 
7(//).  First  consider  dense  band  matrices,  which  have  an  unknown  in  entry  i,jf  if  and  only  if  |»  —  j|  <  w; 
w  is  called  the  bandwidth  of  the  matrix.  For  instance,  when  n  —  5  and  w  —  3,  the  sparsity  pattern  looks 
like 

(X  X  X  0  0\ 

X  X  X  X  0 

X  X  X  X  X  . 

0  X  X  X  X 

Vo  0  XXX/ 

Note  that  such  a  ma  rix  has  rnu)  unknowns.  It  is  well-known  (see  Coleman  and  More  (1981),  Theorem  5.1) 
that  dense  band  matrices  can  lie  approximated  by  using  the  difference  directions 

ItJ 

<?  =  *  —  1,2, 

i=t 
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In  fact,  these  tf  correspond  to  a  SeqDC  for  dense  band  matrices.  Since  the  number  of  evaluations  equals  the 
smallest  k  for  which  the  number  of  unknowns  is  at  most  rnt,  by  Theorem  2.5.1  these  sets  of  directions  have 
the  minimum  possible  cardinality  for  dense  band  matrices.  Thus  the  bound  of  Theorem  2.5.1  is  actually 
achieved  for  dense  band  matrices. 

The  bound  of  Theorem  2.5.1  can  also  be  attained  for  the  complete  graphs  without  loops,  which 
have  unknowns  at  every  entry  i,j,  except  when  i  =  j .  The  insights  discussed  here  are  due  to  oilman 
(1982).  For  instance,  when  n  =  3,  the  sparsity  pattern  is 

/0  X  X\ 

X  0  X  .  (2.5.11) 

\X  X  0/ 

It  is  known  (sec  Powell  and  Toint  (1979),  equation  (5.1))  that  by  differencing  along  dx  -  (1,  1,  l),  (2.5.11)  can 
be  approximated  in  only  one  evaluation.  (Approximating  the  matrix  (2.5.1 1)  with  this  d  is  an  elimination 
method,  and  it  is  easy  to  sec  that  any  substitution  method  must  use  more  than  one  evaluation.  The  matrix 
(2.5.1 1)  seems  to  be  the  only  example  known,  a  point  which  is  further  explored  later  in  this  section.) 

Let  lln  have  the  incidence  matrix  of  the  complete  graph  without  loops  on  n  vertices  as  its  sparsity 
pattern;  it  has  (£)  unknowns.  Suppose  that  //n  can  by  optimally  approximated  by  7n  gradient  evaluations 
by  an  as  yet  unknown  elimination  method.  Then  7„+i  <  7„  -t-  1  since  f/n+*  can  surely  be  approximated 
by  first  differencing  along  dl  =  en+l  to  get  the  last  row  and  column  of  //n+1,  and  then  using  the 
approximation  scheme  on  the  remaining  unknowns,  which  have  the  same  sparsity  pattern  as  //n. 

Let  X„  be  the  lower  bound  on  7„  implied  by  Theorem  2.5.1,  so  that  X„  =  min{/c  ,  Q)  <  nk  —  (£)  } 

which  implies  that  X„  =  f  l+2"~V>ln~M~~|.  Since  (2.5.11)  is  //*,  by  the  remarks  above  and  the  fact  that 
X„  <  ~fn,  the  first  few  values  of  Xn  and  7n  are 

n  12  3  15  6 

X  0  112  3  3. 

7  0  112  3? 


We  now  show  that  7*  =  3.  Arrange  the  vertices  of  //*  in  an  array  like: 


4 


1 


6 


and  difference  along  the  three  indicated  triangles,  i.e  along  dl  =(1,1,  1,0, 0,0),  d*  =  (0, 1,0,  1,1,0)  and 
d 1  =  (0,0,  1,0, 1,  l).  My  (2.5.11)  these  directions  determine  all  the  edges  of  /f®  in  the  triangles,  i.e.  edges 
12,  13,  23,  24,. . . ,  56.  The  123  triangle  difference  (»/')  gives  his  +  ha 5  +  As  5  =  A 5  in  row  5.  Rdgcs  25  and 
35  arc  triangle  edges,  and  so  this  equation  determines  edge  15  (and  by  symmetry,  edges  34  and  26  are  also 
determined).  In  row  4,  the  123  triangle  equation  is  ht  ,  1  h 21  +  h j*  =  A  J.  Kdgc  24  is  a  triangle  edge,  and  it 
was  shown  above  that  edge  34  can  also  lie  determined,  thus  edge  14  (and  so  also  16  and  46)  is  determined. 
But  all  the  edges  in  ff®  have  now  been  determined,  which  implies  that  76  =  3. 

For  n  >  6  the  inductive  method  that  showed  that  7n+i  <  7n  +  I  can  be  used  as  long  as  Xn  +  i  =  X„  +  I. 
It  can  lie  shown  that  X„n  ^  Xn  +•  1  fails  if  and  only  if  —  js  integral.  This  holds  if  and  only  if 

8n  +  I  is  a  perfect  odd  square,  which  happens  if  anil  only  if  n  =  (*2  *)  foe  some  integer  k.  After  n  =  6,  the 
next  such  n  is  n  =  10  --  (4jl).  When  n  has  such  a  value,  Xn  -  Xn_i  =  (J),  which  is  the  next  lower  such 
n.  Such  integers  are  known  as  triangular  iiiiiiiInts  for  the  reason  that  when  n  is  triangular,  n  points  can 
be  arranged  in  a  triangular  array  similar  to  the  configuration  for  n  =  6  above. 

Now  it  is  inductively  easy  to  show  that  when  r»  =  (*2')'  =  7„  =  (J)  by  using  the  (J)  triangular 
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differences 


A 


A  A 


•  •  •  • 

A  A  A  A 


The  induction  hypothesis  shows  that  every  edge  of  Hn  is  determined  by  these  differences  except  for  those 
involving  a  vertex  of  the  largest  triangle  and  a  point  on  the  opposite  edge.  But  these  can  be  determined 
from  the  other  edges  by  a  process  similar  to  that  described  above  for  n  =  6.  Thus  the  triangular  numbers 
are  doubly  triangular  in  this  context. 

For  values  of  n  that  are  not  triangular  numbers,  both  X„  and  *f„  increase  by  one.  Thus  for  all  n, 
Xn  =  in,  and  so  once  again  the  implicit  bound  of  Theorem  2.5.1  is  attained. 

For  the  next  example,  consider  the  sparsity  pattern 


0 

XXX 
XXX 
Vx  X  X 


X  X  x\ 

XXX 

XXX 

0 

J 


(2.5.12) 


for  which  n  —  6  and  if  =  9.  Theorem  2.5.1  says  that  two  evaluations  can  determine  as  many  as  6-2  —  (*)  = 
1 1  unknowns,  and  thus  gives  a  lower  bound  of  only  2.  But  since  (2.5.12)  is  also  a  (complete)  bipartite  graph 
with  g  —  t  ■—  3,  Theorem  2.5.2  gives  a  better  lower  bound  of  3  (and  also  says  that  3  evaluations  arc  optimal. 
Example  (2.5.12)  illustrates  that  Theorem  2.5.2  can  imply  a  higner  bound  than  Theorem  2.5.1. 

There  is  \  way  to  sharpen  further  the  bounds  of  Theorems  2.5  1  and  2.5.2.  Consider  the  sparsity  pattern 


(X  X  X  0\ 
X  X  X  0 
X  X  X  0  f 
0  0  0  X/ 


(2.5.13) 


with  n  =  4  and  if  =  7.  When  jfc  =  2,  Theorem  2.5.1  concludes  that  as  many  as  7  unknowns  could  be 
calculated.  But  in  approximating  (2.5.13),  the  leading  3X3  submatrix  must  be  approximated  as  well,  and 
Theorem  2.5.1  implies  that  k  =  3  evaluations  are  necessary  to  approximate  a  dense  3x3  matrix.  Thus  a 
better  bound  can  lie  obtained  in  some  cases  from  a  submatrix  than  from  the  whole  matrix. 

For  any  (non-empty)  S  C  {1,2, ...,«}  the  bound  of  Theorem  2.5.1  can  be  computed  based  on  the 
submatrix  of  II  whose  rows  and  columns  are  in  S.  The  largest  lower  bound  so  computed  is  then  a  possibly 
sharper  lower  bound  for  -f(// )  than  the  bound  based  on  the  whole  matrix. 

If  a  submatrix  of  If  corresponds  to  a  bipartite  graph,  a  lower  bound  should  be  computed  using  the 
sharper  Theorem  2.5.2.  Unfortunately,  sometimes  Theorem  2.5.2  can  give  a  sharper  result  even  when  a 
submatrix  does  not  correspond  to  a  bipartite  graph.  Consider  the  sparsity  pattern 
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/X 
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0 

0 

0 

0 

X 

X 

X 

X 

X 

x\ 

0 
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X 

X 

0 

0 

X 

0 

0 
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X 

X 
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X 

X 

0 

0 

0 

X 

0 

0 

X 

X 

X 

X 

X 

X 

0 

0 

0 

0 

X 

0 

X 

X 

X 

X 

X 

X 

0 

0 

0 

0 

0 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

0 

0 

0 

0 

0 

X 

X 

X 

X 

X 

X 

0 

X 

0 

0 

0 

0 

X 

X 

X 

X 

X 

X 

0 

0 

X 

0 

0 

0 

X 

X 

X 

X 

X 

X 

0 

0 

0 

X 

0 

0 

X 

X 

X 

X 

X 

X 

0 

0 

0 

0 

X 

0 

Vx 

X 

X 

X 

X 

X 

0 

0 

0 

0 

0 

x) 

with  n  —  12  and  y  =  48.  Since  5  •  12  -  (*)  =  50  >  48,  Theorem  2.5.1  gives  a  best  lower  bound  of  k  ■=  5 
(even  if  checked  for  all  submatrices).  But  by  deleting  the  diagonal  entries,  the  matrix  becomes  bipartite  and 
Theorem  2.5.2  gives  the  better  lower  bound  of  k  =  6. 

Thus,  to  achieve  the  best  lower  bound  derivable  from  Theorems  2.5. 1  and  2.5.2,  it  is  necessary  to  consider 
not  just  the  vertex-induced  subgraphs  of  (/  (which  correspond  to  submatrices  of  //),  but  all  the  edge-induced 
subgraphs  as  well.  The  edge-induced  submatrices  must  be  checked  for  biparlitcness  to  see  whether  Theorem 
2.5.2  can  give  a  higher  bound. 


2.5.4.  Computing  Lower  Bound*  in  Theory 

Next  we  consider  how  to  compute  the  bounds  implicit  in  Theorems  2.5.1  and  2.5.2.  In  princ.plc,  as  was 
shown  in  the  proofs  of  Theorem  2.5.1,  y(II)  can  be  computed  by  the  following  algorithm. 

Algorithm  yt 

0.  Set  k  —  l. 

1.  Construct  the  matrix  An,k,  using,  say,  columns  of  a  Vandermonde  matrix  for  the  d‘. 

2.  Delete  the  columns  of  An,k  corresponding  to  entries  of  //  known  to  be  zero,  yielding  A 

3.  Calculate  rank(A"’*)  =  t*.  If  t*  >  y,  then  y[II)  =  k.  Stop. 

4.  Otherwise,  set  k  «—  k  +  1  and  go  to  1. 

With  infinite-precision  arithmetic,  Algorithm  y  is  a  polynomial  algorithm,  since  Step  3  can  be  performed 
at  most  n  tunes  on  a  matrix  whose  size  is  bounded  by  rt2.  Calculating  rank  is  an  0(si*c)  operation,  yielding 
a  total  time  bound  of  0[n7).  However  in  practice,  with  finite-precision  arithmetic,  the  decisiin  .n  Step  3  is 
not  clear  cut.  Calculating  the  rank  of  any  numerical  matrix  is  extremely  dillicull  in  practice  (see  I’ctcrs  and 
Wilkinson  (1970)),  so  much  so  that  the  conventional  wisdom  among  numerical  analysts  is  that  numerical 
rank  cannot  be  precisely  defined.  Even  if  it  could  be,  the  well-known  classes  of  matrices  guaranteed  to  satisfy 
the  llaar  condition  (such  as  the  Vandermonde  matrices)  are  notoriously  ill-conditioned  and  difficult  to  work 
with  (sec*  Newsam  and  Ramsdcll  (1981),  p.  13  for  similar  concerns  in  the  context  of  Jacobian  approximation). 
Thus  Algorithm  y  would  not  be  practical  even  if  rank  were  calculable. 

A  more  practical  implementation  of  Algorithm  y  is  to  use  randomly  generated  vectors  for  the  df  in 
Step  I.  With  probability  nearly  I,  a  random  matrix  satisfies  the  llaar  condition,  and  in  practice  TP’  is 
usually  well  enough  conditioned  that  Step  3  can  Ik*  carried  out  satisfactorily  in  finite-precision  arithmetic. 
By  repeating  the  randomised  Algorithm  y  several  times  using  different  random  d1,  a  high  degree  of  confidence 
in  the  answer  can  be  obtained.  Indeed  just  such  an  algorithm  has  been  implemented  (in  order  to  search  for 
a  counterexample  to  a  conjecture  that  comes  later),  and  it  has  performed  satisfactorily  on  small  problems 
with  n  <  12. 

The  randomizer!  Algorithm  y  has  some  drawbacks.  First,  it  becomes  increasingly  slow  to  compute  the 
/.(/-factorization  that  is  used  to  calculate  the  rank  of  the  nk  X  y  matrices.  Second,  one  has  less  and  less  faith 
in  the  computed  answer  as  n  gets  large,  due  to  the  usual  numerical  difficulties  in  computing  rank  mentioned 
above,  coupled  with  a  smaller  degree  of  confidence  that  the  random  matrices  satisfy  the  llaar  condition 
to  the  tolerance  of  the  computer.  Third,  the  whole  procedure  is  esthetically  unsatisfying  to  mathematical 
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sensibilities  because  the  number  is  an  intrinsic  characteristic  of  //,  and  calculating  ranks  of  random 

matrices  seems  to  be  a  roundabout  way  of  computing  it.  A  purely  combinatorial  algorithm  would  seem  much 
more  appropriate. 

Unfortunately  such  an  algorithm  does  not  now  exist.  However,  we  make  the  following  two  conjectures: 

Conjecture  2. 5. St  f(H)  is  equal  to  the  largest  lower  bound  computable  via  subgraphs  using  Theorems 
2.5.1  and  2.5.2. 

Coqjecture  2.5.4:  There  is  a  purely  combinatorial  polynomial  algorithm  for  computing  f(ll)  that  is  fast 
in  practice. 

The  evidence  for  Conjecture  2.5.3  is  that  '/(H)  has  been  computed  for  many  small  II  (n  <  12)  by  using 
the  randomized  Algorithm  7,  and  no  counterexample  has  been  found.  Also,  the  conjecture  is  true  for  all  the 
examples  discussed  above.  That  is,  every  II  tried  so  far  does  have  some  subgraph  K  such  that  7 (//)  is  the 
smallest  k  for  which  the  bound  of  cither  Theorem  2.5.1  or  2.5.2  is  satisfied  on  K. 

The  evidence  for  Conjecture  2.5.4  is  stronger  than  a  lack  of  counterexamples.  There  already  exists  an 
almost  practically  effective  way  to  compute  '/(II),  namely  the  randomized  version  of  Algorithm  7.  The  fact 
that  "/(II)  can  be  easily  computed  for  many  //  appears  to  be  inconsistent  with  any  supposition  that  finding 
7 (//)  is  NP- Complete.  A  striking  characteristic  of  most  NP-CompIetc  problems  is  that  they  are  no  easier  in 
practice  than  they  are  in  theory.  Thus  the  existence  of  the  randomized  Algorithm  7  seems  strong  evidence 
for  Conjecture  2.5.4. 

2.5.5.  Computing  Lower  Bounds  in  Practice 

If  Conjectures  2.5.3  and  2.5.4  are  both  true,  it  would  follow  that  there  is  a  (practical)  combinatorial 
polynomial  algorithm  for  computing  X(//),  the  largest  lower  bound  implied  by  Theorems  2.5.1  and  2.5.2  over 
all  subgraphs  oT  H .  Some  preliminary  work  is  presented  next  on  how  to  compute  \(H),  which  can  also  be 
taken  as  evidence  for  both  conjectures. 

For  simplicity,  assume  at  first  that  II  is  bipartite;  then  all  of  its  edge-induced  subgraphs  are  also 
bipartite,  so  that  only  the  bound  of  Theorem  2.5.2  is  relevant.  Let  E  be  any  subset  of  the  unknowns  of  II 
(or  of  the  edges  of  its  graph),  and  let  N(E)  =  {  i  1  A,y  G  E,  some  j  }.  For  example,  if 


/° 

0 

X 

X 

X\ 

0 

0 

X 

X 

X 

X 

X 

0 

0 

0 

X 

X 

0 

0 

0 

Vx 

X 

0 

0 

0/ 

and  E  —  {  fcn,  An,  At 5  },  then  N(E)  —  { 1,3, 4,5). 

In  the  subgraph  of  II  determined  by  E,  Theorem  2.5.2  says  that  k  evaluations  might  suffice  for  the  |/?| 
unknowns  if  h"  <  N(E)\  ■  k  -  k2 .  Thus  the  largest  lower  bound  over  all  E  must  be 

min(  k  |  \E\  <  | N(E)\  k  -  k2  for  all  E).  (2.5.15) 

To  see  why,  denote  the  minimizing  k  in  (2.5.15)  by  k  ,  and  let  E  be  an  edge  subset  that  blocks  k  from 

being  smaller,  so  that  k  min(fc  '  \E  |  <  |/V(/l’  )|  •  k  -  k2  }.  Hut  this  is  the  definition  of  the  lower 
bound  on  7 (//)  derivable  from  E~ ,  so  that  fc*  is  a  lower  bound  on  7(//).  For  any  other  edge  subset  E, 
\E\  <  | N(E\  k'  (it*)2,  which  implies  that  the  lower  bound  derivable  from  E  is  at  most  k  . 

Equation  (2.5  15)  is  reminiscent  of  Phillip  Hall-type  theorems  (see,  c.g.,  Welsh  (1976),  pp.  97  98).  [.et 

b  be  a  bipartite  graph  witli  left  vertices  S  and  right  vertices  T.  For  U  C  S,  let  V(U)  =  {»  G  T  | 

(s,«  }  is  an  edge  of  If  for  some  s  G  II  }•  The  Phillip  Hall  theorem  of  interest  is 

Theorem  2.5.5:  la't  rn  '.S’!  and  d  >  0.  Then  li  has  a  matching  of  size  m  —  <f  if  and  only  if 

!f/|  <  |P(f/)l  +  d,  for  all  U  C  S.  (2.5.16) 


(See  Welsh  (1076),  Theorem  7  4.)  □ 


32 


Approximating  Sparse  Hessians 


Chapter  2 


Theorem  2.5.5  is  interesting  because  it  provides  a  polynomial  algorithm  that  simultaneously  verifies  an 
exponential  number  of  inequalities.  That  is,  verifying  the  2m  inequalities  in  (2.5.16)  is  equivalent  by  Theorem 
2.5.5  to  finding  out  whether  B  has  a  matching  of  sue  at  least  m—  d;  since  good  polynomial  algorithms  exist 
for  finding  a  maximum  cardinality  matching  (sec  Lawler  (1976),  Chapter  5),  the  equivalence  implies  that 
there  is  a  polynomial  algorithm  for  verifying  the  inequalities  (2.5.16). 

It  is  simple  to  generalize  Theorem  2.5.5  slightly  when  the  inequalities  to  be  verified  are 

\U\  <  i r(U),  k+d  for  all  U  C  S.  (2.5.17) 

Recall  that  the  maximum  cardinality  matching  problem  on  B  is  equivalent  to  a  network  flow  problem  N  as 
follows  (see  Lawler  (1976),  Section  5.2).  Direct  each  S,  T  edge  from  S  to  T  with  capacity  oo,  add  an  arc 
(s,«)  with  capacity  I  for  all  i  €  S,  and  add  an  arc  (j,t)  with  capacity  1  for  all  j  G  T.  Then  N  has  a  flow  of 
value  m  -  d  if  and  only  if  B  has  a  matching  of  size  m  -  d. 

Now  change  the  capacity  of  each  (j,t)  arc  from  1  to  it  (which  corresponds  to  “multiplying”  each  T 
vertex  of  B  by  it).  Theorem  2.5.5  becomes 

Theorem  2.5.6:  The  2m  inequalities  (2.5.17)  arc  true  if  and  only  if  N  has  a  flow  of  value  m  -  d.  Q 

Since  network  flow  also  has  a  polynomial  algorithm  (see  Lawler  (1976),  Chapter  '1),  there  is  a  fast  way 
of  verifying  (2.5.17)  as  well. 

Now  suppose  that  the  2m  —  1  inequalities 

it/ 1  <  |r((/),  ft  d  for  all  0  ^  If  C  S  (2.5.18) 

could  also  be  verified  in  polynomial  time  for  any  d  >  0.  Consider  the  bipartite  graph  B  which  has  5  = 
{unknowns  of  //  },  T  =  {  1,2,  ...,n)  and  edge  {  htJ,l}  when  /  =  i  or  j.  Then  for  li  C_  S,  l'(/i‘)  =  N(Ii). 
Hence,  by  setting  d  —  fca  and  iterating  the  hypothetical  procedure  for  it  =  n,  n  -  1, . .  . ,  the  minimizing  it 
in  (2.5.15)  could  be  determined.  This  line  of  reasoning  makes  it  interesting  to  find  a  polynomial  algorithm 
to  solve  (2.5.18)  (an  apparently  minor  variant  of  (2.5.17)). 

Such  an  algorithm  has  been  provided  by  Saks  and  Kahn  (1985).  Consider  the  network  /V  of  Theorem 
2.5.6,  let  Ni  denote  N  with  the  capacity  of  the  single  arc  (s,«)  changed  from  1  to  oo.  Ily  the  usual  argument, 
a  minimum  cut  for  N,  must  be  of  the  form  {  a  }  u  U  u  I'(ff  )  for  some  U  C  S.  Since  the  capacity  of  (s, »)  is 
oo,  »  must  be  in  U .  Thus  the  minimum  cut,  and  so  the  maximum  flow,  for  /V,  solves  the  problem 

min  m  |f/j  +■  k  ■  |r(t/)|.  (2.5.19) 

«et/£S 

Solve  each  of  the  maximum  flow/iniuimum  cut  problems  N,  and  let  (/*  be  a  minimixer  in  (2.5.19)  for  an  Ni 
with  the  smallest  capacity  minimum  cut.  Now,  if  (2.5.18)  is  satisfied,  then  certainly  m  +  d<  ir(t/*)i-fc- 
| U  j  -t-  m  =  smallest  value  of  any  Ni  flow.  Conversely,  suppose  that 

m  +  d  <  |r(f/’)|  •  k  -  |l/*|  +  m.  (2.5.20) 

Uy  definition  of  f/*,  the  right-hand  side  of  (2.5.20)  is  less  than  or  equal  to  |I'(C/ ))  •  k  -  |(/|  i-  m  for  all  U  C  S, 
and  so  (2.5.18)  is  satisfied. 

The  problem  of  verifying  the  2m  -  1  inequalities  in  (2.5.18)  has  been  reduced  to  solving  m  network 

flow  problems,  and  so  can  be  done  in  polynomial  time.  Let  V  be  the  set  of  nodes  of  Nt,  and  li  the 

set  of  its  arcs.  Then  the  complexity  of  solving  one  A/,  maximum  flow  problem  is  L>(|l/|;/*;|log|V'|)  (sec 
Papadimitriou  and  Sticglitx  (1982),  Chapter  9).  Since  verifying  (2.5.18)  involves  m  problems,  it  is  of 
complexity  0(m|V'||/,,’|  logjKI). 

For  the  problem  of  interest,  j.S’J  —  m  =■  17,  |  V .  =  {6’  u  T\  ---  n  +  7,  and  \E\  —  n  -1  3r).  The  procedure 
iteraU's  at  most  7  times,  for  a  total  tunc  complexity  of  0(ri(n  +  7)*  log(n  +  7)).  When  7  -  C(n),  the 
complexity  reduces  to  ()(nJ  log  n). 

To  rc-capitiilatc,  we  have  shown  that  when  //  is  bipartite,  the  following  algorithm  exactly  calculates 

X(//); 


Section  2.5.6 


A  Bound  tor  Higher-Order  Derivative! 
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Algorithm  Bt 

0.  Set  k  =  [»J-  1. 

1.  For  »  =  each  uoknown  in  //  do 

Construct  network  TV,-. 

Solve  maximum  How  on  TV,-,  with  value 

2.  Let  v  —  min<  t/<. 

3.  If  rj  +  k2  <  v*  set  Jb  «-•  Jfc  —  1,  go  to  1. 

4.  Else  answer  X(//)  =  i+l. 

The  existence  of  Algorithm  B  is  certainly  consistent  with  the  truth  of  Conjectures  2.5.3  and  2.5.4. 
When  If  is  not  bipartite,  it  is  easy  to  see  that  the  best  bound  given  by  Theorem  2.5.1  can  be  calculated  by 
replacing  Step  3  of  Algorithm  B  by  “If  17  +  (J)  <  v*. . . since  the  only  difference  between  the  conclusions 
of  Theorems  2.5.1  and  2.5.2  is  the  substitution  of  (J)  for  Jfc’-  Unfortunately,  as  example  (2.5.14)  shows,  the 
answer  resulting  from  the  modified  Algorithm  B  is  not  in  general  equal  to  X(//),  even  when  all  the  diagonal 
elements  of  //  arc  unknowns.  Resolving  this  difficulty  is  an  area  for  more  research. 


2.5.6.  A  Bound  for  Higher-Order  Derivatives 

A  question  that  naturally  arises  is  bow  this  research  extends  to  approximating  higher-order  derivatives. 
Such  an  extension  is  not  a  practical  concern,  for  storing  and  working  with  a  moderately  large  order-3 
derivative  array,  even  if  it  is  sparse,  would  be  prohibitively  expensive.  Nevertheless,  a  mathematical  sense 
of  completion  can  make  such  extensions  interesting.  The  urge  to  generalize  has  been  resisted  in  most  of 
the  rest  of  this  thesis  (except  perhaps  in  the  first  proof  of  Theorem  2.4.1),  but  we  shall  yield  to  it  here  and 
indicate  how  to  generalise  Theorem  2.5.1  to  higher-order  derivatives. 

Let  us  review  the  definition  of  a  higher-order  derivative.  For  a  function  F:  RB  — ►  R,  its  derivative  or 
order  m,  VmP,  evaluated  at  point  x°,  is  the  n  X  n  X  •  •  X  n  (m  times)  array  of  numbers 


VmF(x°)  =  (hi,  ,i . ,„.),  where 


dmF{x°) 


dxitdxit-  ■  dxi„ 


By  symmetry  of  repeated  derivatives,  A,-,,,-, . =  A»(,', ).»(.,) . «■(»„)  for  any  m-permutation  ir.  Thus,  the 

number  of  |>otcntially  different  entries  in  VmF(z°)  is  the  cardinality  of  the  set  Inm  =  {  (*i ,  *2, . . . ,  »m)  |  1  < 
t|  <  tj  <  •  •  ■  <  *m  <  n  },  the  set  of  m-sclcctiona  from  {  1,2, . . .  ,n  }  (they  arc  selections  instead  of  subsets 
since  they  can  have  repeated  entries). 

Let  !\i  =  { (*i,  *2 . it) !  t  <  ii  <  *2  <  •  •  •  <  »i  <  k  },  the  set  of  /-subsets  of  {  1,2, . . .  ,k  },  so  that 

'P*l |  =  (|).  Then  the  bijection  between  elements  of  Inm  and  Pn+m_i>Tn  given  by 


(hi*2r  •  'dm)  €  Aim  M  (hdj  "f  L  •••  »*'m  "l"  ^  0  €  Pn+m— l,m 

(sec  Knuth  (1973),  Section  1.2.6,  exercise  60)  shows  that  |/nm|  =  ("+m_,)i  80  that  the  number  of  different 
unknowns  in  a  dense  VmF(x°)  is  also  (n+™-1)  (a  special  case  is  that  when  m  =  2,  the  number  of  unknowns 
in  a  dense  Hessian  is  (n^ '),  ns  already  discussed). 

Let  //”'  —  V"'/‘'(x°).  In  approximating  lfm  by  finito-difTerencing  Vm_l  F  along  directions  d1,  the 
*  tn 

approximation  II  satisfies  the  linear  equations 


£  A..a, . =  (Vm-‘P(*°  -  d‘)  -  Vm-'P(x°))ti  ii . <m, 

•f  —  l 


(2.5.21) 


I  <  ij  <  n,  I  <  /  <  k.  E;ich  dl  would  appear  to  give  rise  to  n”*-1  equations  in  (2.5.21),  namely  one 
for  each  different  «2,i;i,.  but  since  Vm_  1 F  is  also  symmetric,  these  equations  are  identical  under 

permutation  of  »2,i3,  .  . ,  «m.  Thus  it  can  bo  assumed  without  loss  of  generality  that  it  <  ij  <  ■  ■  <  im,  so 
that  each  df  actually  gives  rise  to  ln,m-\  equations  in  (2.5.21).  It  follows  that  (2.5.21)  is  a  set  of  k(n^2,t) 
equations  in  (” ’)  unknowns. 
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The  usual  approximation  method  would  again  choose  dl  —  et .  It  is  easy  to  sec  that  k  evaluations  of 
Vm_1  F  with  these  d‘  (directly)  determines  every  entry  of  U  whoso  smallest  subscript  is  at  most  k.  If  an 

*  IT) 

entry  of  //  has  all  its  subscripts  greater  than  k,  then  its  subscripts  arc  effectively  an  rn-selection  from  the 
set  {  fc  -t  1 ,  fc  -t-  2, . . . ,  n  }  of  cardinality  n-k.  Thus  m  —  (n~*?„"‘_l)  out  of  the  l)  tolai  unknowns 

arc  not  determined  by  the  k  evaluations  along  the  unit  vectors,  so  that  ("+™-1)  —  (’* l)  unknowns 
are  determined.  The  proof  of  Theorem  2.5.1  leads  to  the  conjecture  that  at  mast  this  number  of  unknowns 
can  be  determined  by  general  dl  as  well,  and  indeed  the  following  theorem  is  the  proper  generalization  of 
Theorem  2.5.1. 

Theorem  2.5.8s  When  approximating  VmF  (possibly  with  sparsity  conditions)  by  finite-dilTcrcncing 
Vm~l  F  along  directions  dl,d2, . . .  ,dk,  at  most 


(n  +  m  -  l\  fn 
m  )-{ 


—  k  +  m  — 


unknowns  can  be  determined  by  the  k  evaluations  of  Vm_l F. 

In  particular,  n  evaluations  are  necessary  to  approximate  Vm  F  when  it  is  dense.  This  bound  is  tight 
for  some  sparsity  patterns. 

Sketch  of  Proof:  This  proof  is  very  much  in  the  spirit  of  the  row-oriented  proof  of  Theorem  2.5.1. 

Let  Anmk  be  the  k(n^,™i2)  X  ("+™~ *)  coefficient  matrix  of  equations  (2.5.21).  As  was  the  case  in 

Theorem  2.5.1,  it  suffices  to  show  that  rank  Anmk  =  rnm*  when  H  is  completely  dense.  A  key  observation 
in  the  proof  is  the  identity 


+  m  —  i  -  1 


(2.5.22) 


of  which  (2.5.4)  is  a  specialization.  This  identity  is  easy  to  prove  by  induction. 

A  sequence  of  matrices  A"”1*  —  Zl ,  Z* ,  Zm  can  be  defined  with  entries  Trom  ihe  d‘ ,  where  Z'  is 
*)(t)  X  (m^IVi'Xi-i)'  Note  that  the  row  size  of  Z'  is  the  absolute  value  of  the  »th  term  of  (2.5.22), 
and  that  the  column  size  of  Z'  equals  the  row  size  of  Z'~l.  The  property  that  the  Z'  are  constructed  to 
satisfy  is  that  Z'Z'~l  =0,  t  =  2,3,  ...,m,  so  that  each  row  of  Z'  is  in  the  null  space  of  the  columns  of 

Z*~l. 

Now  set  d1  =  e1,  which  can  only  decrease  the  rank  of  each  Zl ■  As  the  base  or  an  induction,  using  these 
particular  d1  it  can  be  shown  that  Zm  has  full  row  rank,  namely  (,*).  Thus,  since  Zm~  1  has  n(m*  ,)  rows 
and  Zm  is  in  its  null  space,  rank£m~‘  <  n(ml,)  -  (£).  But  with  these  d1  a  square,  diagonal  submatrix 

of  Zm~‘  of  size  -  (£)  can  be  found,  and  hence  rank  Zm~l  =  n(mt,)  -  (£),  the  last  two  terms  of 

(2.5.22) . 

At  the  general  step  it  is  known  that  rank  Z ,+  l  =  -  53(Li+i(-1)<_'(n+m-<  _1)(i)  anc*  has  (n+m-<  ~')(< 

rows,  so  that 

™kr<|(-l)-(”^:‘-')(‘).  (2.5.23) 

But  then  there  is  a  square  submatrix  of  Z'  of  size  (2.5.23)  which  is  diagonal  with  this  choice  of  d1,  hence 

(2.5.23)  is  really  an  equality,  and  the  induction  can  proceed.  The  induction  terminates  at  *  =  1  which  yields 


rank Zl  =  rankAnm*  =  -  £j(-l)^n  + j 

(n  +  m-l\  (n  -  k  +  m  -  1\ 

m  J  V  m  ) 


as  desired.  0 


It  would  be  mathematically  interesting  to  obtain  a  similar  generalization  of  Theorem  2.5.2. 


Section  2.6 
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2.6.  Reflections  on  Sparse  Hessians 

We  have  investigated  the  approximation  of  sparse  Hessians,  largely  from  the  point  of  view  of  com¬ 
putational  complexity.  Sections  2.2  and  2.3  showed  that  various  methods  that  have  hoen  proposed  for 
approximating  sparse  Hessian  can  be  classified  in  a  helpful  way. 

Using  this  classification,  Section  24  showed  that  finding  an  optimal  approximation  scheme  for  each 
subclass  of  direct  methods  is  an  NP-Completc  problem.  The  theorem  of  Coleman  and  More  (1982),  which 
is  quoted  as  Theorem  2. 4.9,  gives  a  similar  NP-Coinpletcness  result  for  a  subclass  of  substitution  methods. 
Their  theorem  suggests  that  as  with  direct  methods,  all  subclasses  of  substitution  methods  are  NP-Complete. 
Fully  investigating  the  complexity  of  substitution  methods  is  a  useful  topic  for  future  research. 

By  contrast,  the  results  of  Section  2.5  tend  to  support  the  view  that  determining  the  minimum  number 
of  gradient  evaluations  needed  by  a  given  sparsity  pattern  for  an  elimination  method  can  be  computed 
by  a  polynomial  algorithm  (however,  computing  a  set  of  numerically  “reasonable"  difference  directions  that 
realizes  the  minimum  may  be  much  harder).  Thus  in  passing  through  the  spectrum  of  approximation  methods 
from  direct  methods  (simple,  stable,  large  number  of  evaluations)  to  elimination  methods  (complicated, 
possibly  unstable,  smallest  number  of  evaluations),  a  boundary  between  NP-Completeness  and  polynomial 
algorithms  seems  to  be  crossed  between  the  substitution  methods  and  the  elimination  methods.  Clearly  more 
work  needs  to  be  done  to  establish  the  truth  or  falsity  of  Conjectures  2.5.3  and  2.5.4. 

An  intriguing  additional  reason  to  study  Conjectures  2.5.3  and  2.5.4  is  that  very  few  examples  are 
known  of  sparse  Hessians  where  an  optimal  elimination  method  uses  strictly  fewer  gradient  evaluations 
than  an  optimal  substitution  method.  The  standard  (and  apparently,  essentially  the  only)  example  of  this 
phenomenon  is  (2.5.11)  (though  it  is  likely  that  all  romplctc  graphs  without  loops  also  fall  in  this  class). 
It  is  not  clear  whether  such  examples  arc  inherently  rare,  or  whether  there  has  been  insufficient  work  in 
constructing  them.  If  such  examples  arc  rare,  then  establishing  the  truth  or  falsity  of  Conjectures  2.5.3 
and  2.5.4  is  even  more  important  since  the  ability  to  compute  ~i(ll)  for  a  substitution  method  would  be  a 
valuable  guide  for  a  substitution  heuristic.  On  the  other  hand,  if  such  examples  are  common,  being  able  to 
compute  -»( // )  might  aid  in  searching  for  them. 

Though  practicalities  have  been  mentioned  along  the  way,  our  emphasis  has  been  on  complexity  rather 
than  computation.  Thus  the  reader  may  still  be  uncertain  as  to  what  method  to  choose  to  approximate  a 
sparse  Hessian.  The  direct  methods  reported  in  Coleman  and  More  (1982)  are  simple,  numerically  stable, 
and  very  fast  (both  in  finding  the  groups  and  in  approximating  II  given  the  groups),  and  empirically  give 
fairly  good  results  (sec  their  Table  4.1).  Substitution  methods  are  inherently  less  numerically  stable  than 
direct  methods,  though  Howell  and  Toint  (1979)  show  that  the  accumulated  error  in  a  substitution  method 
cannot  grow  loo  fast.  The  triangular  substitution  methods  in  Coleman  and  More  (1982)  are  almost  as 
simple  as  their  direct  methods,  reasonably  numerically  stable,  fast  in  finding  the  groups,  somewhat  slower 
in  solving  for  H  given  the  groups,  but  empirically  use  significantly  fewer  gradient  evaluations  than  their 
direct  heuristics  (compare  their  Tables  4.1  and  8.1).  To  our  knowledge,  no  general  elimination  methods  have 
been  proposed.  Except  in  special  cases,  such  as  the  complete  graphs  without  loops  mentioned  in  Section  2.5 
(which  ePeclively  use  a  substitution  method  except  for  solving  systems  like  (2.5.11)),  elimination  methods 
are  expected  to  have  such  potentially  unreliable  numerical  properties  as  to  make  them  practically  useless. 
At  this  point,  the  triangular  substitution  heuristics  in  Coleman  and  More  (1982)  seem  to  be  the  best  for 
general  usage. 

This  chapter  has  resolved  many  of  the  previously  open  questions  about  approximating  sparse  Hessians. 
However,  much  work  remains  before  the  subject  is  completely  understood. 


i 


Chapter  3 


Making  Sparse  Matrices  Sparser 


3.1.  Introduction  to  Making  Matrices  Sparser 

Many  large-scale  constrained  optimisation  problems  are  of  the  form 

min  F(x) 

s.  t.  Ax  —  b  (3.1.1) 

/<*<«, 

where  l,  it,  x  £  Rn,  F :Rn  -»  R  and  A  is  an  m  X  n  matrix.  This  is  a  linearly  constrained  problem  with 
bounded  variables,  t'sually  m  is  less  than  n,  and  hence  there  are  many  x  that  satisfy  Ax  =  6. 

Quite  large  problems  of  the  form  (3.1.1)  have  been  solved,  some  with  m  >  10,000,  n  >  50,000  (see, 
e.g.,  Ilillier  and  Lieberman  (1974),  pp.  180  -181).  If  such  an  A  were  dense,  then  storing  and  accessing  its 
entries  would  cause  an  optimisation  program  to  be  painfully  slow. 

The  reason  that  very  large  problems  can  be  solved  in  practice  is  that  they  are  sparse;  most  of  the 
entries  of  A  are  zero.  A  rule  of  thumb  that  is  used  in  some  applications  is  that  an  average  column  of  A 
usually  has  less  than  Urn  non-zeros  in  it,  often  less  than  five  non-zeros.  A  matrix  with  the  above  dimensions 
would  be  expected  to  have  only  about  500,000  non-zeros,  a  decrease  of  three  orders  of  magnitude  from  the 
number  of  possible  entries. 

Define  the  density  of  a  sparse  matrix  as  the  fraction  of  entries  of  A  that  are  non- zero.  Then  when 
n  =  0(m),  the  number  of  possible  entries  of  real-life  matrices  is  0(m*)  whereas  their  number  of  non-zeros 
is  only  0(n),  so  that  their  density  is  0(\/m). 

To  take  advantage  of  sparsity  it  is  necessary  to  store  t  and  j  along  with  oy,  thus  incurring  a  storage 
overhead,  but  a  relatively  small  one  (matrix  indices  can  often  be  stored  in  many  fewer  bits  than  numerical 
values).  The  necessity  of  manipulating  the  indices  makes  some  simple  sparse  matrix  tasks  quite  complicated. 
For  example,  it  can  be  non-trivial  to  transpose  a  sparse  matrix  in  some  representations  (sec  Custavson 
(1973)).  Programs  for  processing  sparse  matrices  arc  therefore  much  longer,  more  complicated  and  harder 
to  develop  than  for  their  dense  counterparts. 

Nevertheless  the  immense  savings  in  execution  time  over  comparable  dense  algorithms  warrants  taking 
account  of  sparsity  By  exploiting  sparsity,  much  larger  problems  have  been  solved  much  faster  than  they 
have  would  been  otherwise.  Indeed,  the  largest  problems  could  not  effectively  be  solved  at  all  without  sparse 
matrix  techniques. 

Sparse  methods  arc  not  faster  than  dense  methods  simply  because  there  arc  many  fewer  numbers  to 
keep  track  of.  There  is  another,  more  subtle  phenomenon  working  in  favor  of  sparsity.  Consider  solving  the 
system  of  linear  equations 

Dz  =  b  (3.1.2) 

when  H  ismX  m  and  non-singular.  The  usual  dense  Gaussian  elimination  procedure  takes  0(ms)  time. 
Since  the  number  of  entries  of  D  is  to*,  the  execution  time  is  superlinear  in  the  amount  of  data. 

Now  consider  solving  the  same  system  when  B  is  sparse,  assuming  0(m)  non-zeros.  It  has  been 
empirically  observed  that  a  well-implemented  sparse  Gaussian  elimination  technique  takes  only  0(m)  time 
(sec  Duff  (1977),  Table  3).  This  observation  is  true  partly  because  even  real  problems  without  apparent 
structure  seem  to  have  some  hidden  structure,  though  in  a  way  that  has  resisted  quantification.  It  seems 
doubtful  that  such  good  results  would  bo  obtained  on  matrices  whose  non-zero  entries  were  randomly  located. 
Thus  in  a  typical  sparse  situation,  linear  equations  can  lie  solved  in  lime  linear  in  the  amount  of  data. 
Since  solving  linear  equations  is  a  ubiquitous  operation  in  optimization,  such  an  improvement  represents  a 
significant  speed-up  compared  to  the  dense  case. 
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Since  the  time  required  to  perform  many  kinds  of  sparse  matrix  operations  is  proportional  to  the  number 
of  non-zeros  in  A,  trying  to  make  the  given  sparse  linear  constraints  Ax  —  b  even  sparser  seems  to  be  a 
uatural  problem  to  solve.  That  is,  instead  of  accepting  the  degree  of  sparsity  in  the  model  .is  formulated, 
it  might  help  to  increase  the  degree  of  sparsity  (decrease  the  density).  More  succinctly,  since  sparsity  is  a 
virtue,  sparser  should  be  better. 

An  obvious  application  of  such  an  effort  is  that  in  solving  (3.1.1),  optimisation  routines  solve  many 
systems  of  linear  equations  like  (3.1.2),  where  H  varies  over  various  subinatnces  of  A  (see  C.nl,  Murray  and 
Wright  (1981),  Chapter  5).  Thus,  if  A  were  sparser,  the  various  U' s  would  (on  average)  be  .-.parser.  Since 
execution  time  depends  on  number  of  non-zeros,  the  speed  of  optimization  would  increase. 

A  less  obvious  application  is  when  the  linear  constraints  of  (3.1.1)  are  replaced  by  non-linear  constraints 
e( x)  =  0,  where  c  is  a  function  from  R"  to  Rm.  Such  non-lincarly  constrained  problems  are  often  sparse  in 
the  sense  that  the  Jacobian  of  c(x)  is  a  sparse  matrix,  and  its  sparsity  pattern  (zero/non-zero  .structure)  is 
independent  of  x.  An  algorithm  that  is  able  to  make  a  matrix  sparser  using  only  its  sparsity  pattern  could 
therefore  be  useful  for  non-linear  problems. 

These  possible  applications  lead  to  considering  the 


Sparsity  Problem  (SP):  Given 
find  an  equivalent  system 


Ax  =  6, 
Ax  -  -  6 


(3.1.3) 

(3.1.4) 


which  is  as  sparse  as  possible,  when:  equivalent  means  that  the  same  set  of  x’s  satisfy  both  systems. 
From  simple  linear  algebra,  (3.1.3)  and  (3.1.4)  arc  equivalent  if  and  only  if  A  —  T A  and  6  -  Tb  for  some 
m  X  m  non-singular  matrix  T.  Thus,  solving  SI'  is  equivalent  to  finding  a  T  that  minimizes  the  number  of 
non-zeros  in  T A.  This  chapter  explores  some  ways  to  solve  SP  in  theory  and  in  practice. 


3.1.1.  Relationship  to  Bipartite  Matching 

The  methods  that  we  shall  use  to  solve  SP  involve  bipartite  matching  theory.  There  is  a  simple 
correspondence  between  bipartite  graphs  and  sparsity  patterns  of  rectangular  matrices.  When  we  write 
a  sparsity  pattern  a  zero  is  represented  by  “0”  or  a  blank,  and  a  non-zero  by  “X".  Cm  .,  the  sparse 
matrix  A,  define  the  bipartite  graph  B  by  setting  the  left  nodes  of  B  =  {  rows  of  A  },  the  right  nodes  of 
B  —  {  columns  of  A  },  and  the  edges  of  B  =  {  {  »,  j  }  ,  a,y  ^  0).  For  example,  if 

^“(o  x  x)  thcnB== 

This  correspondence  allows  us  refer  to  sparsity  patterns  and  bipartite  graphs  interchangeably.  In  this  chapter 
sparsity  patterns  will  be  displayed  as  matrices,  but  the  language  of  bipartite  graphs  will  be  used  to  describe 
them. 

A  subset  P  of  the  non-zeros  of  A  such  that  no  two  elements  of  P  lie  in  the  same  row  or  column  is 
classically  known  as  a  partial  transversal  (see  Welsh  (1976),  Section  7.1).  A  partial  transversal  corresponds 
to  a  (not  necessarily  maximum)  matching  (sec,  e.g.,  Lawler  (1976),  Chapter  5)  in  a  bipartite  graph  (i.e., 
a  subset  of  edges  with  no  common  vertices).  For  example,  the  circled  transversal  corresponds  lo  the  heavy 
matching  in  the  bipartite  graph  B: 

X  0\ 

A  =  I  0  ®  0,  B  = 

Vx  0 

We  shall  favor  the  term  “matching”  even  though  it  is  historically  inappropriate  for  matrices. 

A  matching  in  A  is  called  row-perfect  if  all  rows  of  A  are  in  the  matching;  column-perfcct  is  defined 
similarly.  A  matching  is  perfect  if  it  is  both  row-  and  column-pcrfcct.  A  maximum  matching  is  one 
with  a  maximum  number  of  non-zeros.  If  /f  C  (  and  C  C  {  l,2,...,n}  then  A«r-  denotes  the 

submalrix  of  A  indexed  by  rows  in  It  and  columns  in  C.  I<el  £(A)  =  (  non-zeros  of  A  },  and  let  M(Akc)  be 
the  size  of  a  maximum  matching  in  Ahc\  M(Ah<;)  is  sometimes  called  the  term  rank  of  A^c  (see  Ryser 
(1963),  Chapter  5).  An  important  property  of  maximum  matchings  that  will  lie  used  repeatedly  is  stated  in 
the  following  proposition. 
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Proposition  3.1.1s  If  M  is  a  maximum  matching  in  A,  the  rows  and  columns  of  A  can  be  permuted  so 
that  A  can  be  partitioned  as 


(3.1.5) 


where  M  C  £(C)  U  £(D);  M  P  £(C)  is  column-perfect  for  C  and  is  row-perfect  for  C  if  and  only  if  X  is 
row-perfect;  and  is  row-perfect  for  E  and  is  column-perfect  for  E  if  and  only  if  .M  is  column-perfect. 

0 


This  proposition  follows  from  the  Konig- Eger  vary  Theorem  (see  Ryser  (1963),  Theorem  5.1).  As  an 
example  of  the  proposition,  consider 


with  the  circled  maximum  matching.  The  matching  k 
perfect  for  both  C  and  A. 


I 

®i _ 

x  I  0  X  ’ 

I  X  ®  X; 

not  column-perfect  for  E,  nor  for  A,  but  it  is  row- 


3.1.2.  Possible  Approaches  to  Increasing  Sparsity 

Two  possible  assumptions  can  be  made  in  dealing  with  sparse  matrix  problems  like  SP.  The  first  is  that 
almost  all  the  information  about  A  is  contained  in  its  sparsity  pattern,  and  almost  none  is  embodied  in  the 
v  *  .ml  values  of  the  non-xeros.  This  assumption  is  used  by  the  graph  models  of  the  location  or  fill-in  during 
sparse  Gaussian  elimination  which  occur  in  theorems  about  the  complexity  of  minimixing  fill-in  (sec  Problem 
GT46'  in  Garey  and  Johnson  (1979)). 

The  complementary  assumption  is  that  the  non-xero  values  have  a  structure  that  can  be  exploited  in 
solving  SP.  This  assumption  would  lead  to  an  algorithm  that  would  try  to  discover  numerical  relations  among 
the  non-xcros  in  an  effort  to  increase  sparsity.  An  example  of  this  assumption  as  used  for  a  different  problem 
is  the  work  of  llixby  and  Cunningham  (1983)  on  solving  linear  programs  faster  by  finding  large  embedded 
networks. 

We  shall  use  the  first  approach  in  this  chapter.  Indeed,  in  the  application  to  non-Iincar  constraints 
discussed  above,  no  other  approach  is  possible  since  there  is  a  fixed  sparsity  pattern  with  changing  numeric 
entries. 


3.1.3.  Overview  of  this  Chapter  1 

Section  3.2  opens  with  a  discussion  of  why  SP  is  difficult  without  making  a  generality  assumption.  A 
rigorous  definition  of  the  assumption  used  in  the  rest  of  the  chapter  is  then  given,  and  is  applied  to  derive 
polynomial  algorithms  to  solve  SP.  With  the  assumption,  a  polynomial  algorithm  is  constructed  that  solves 
an  important  subproblem  of  SP,  the  One  Row  Sparsity  Problem.  This  algorithm  is  at  the  heart  of  all  the 
other  algorithms.  I 

'n  Section  3.3,  the  One  Row  Algorithm  of  Section  3.2  is  used  to  derive  two  polynomial  algorithms 
for  solving  SP.  One  of  the  algorithms  is  important  for  theoretical  reasons,  and  the  other  can  be  modified 
into  a  practically  implernentable  algorithm.  Some  theoretical  consequences  of  these  algorithms  for  Dulmagc- 
Mendclsohn  decomposition  and  for  evaluating  the  complexity  of  matroid  algorithms  arc  also  derived  in 
Section  3.3. 

These  algorithms  arc  developed  into  a  more  practical  version  in  Section  3.4.  This  section  also  considers  I 

what  happens  when  the  practical  algorithm  is  used  to  process  real  problems  that  do  not  satisfy  the  generality 
assumption,  and  it  is  shown  that  the  (icrformancc  of  the  practical  algorithms  can  lie  no  worse  than  the 
performance  of  the  theoretical  algorithms.  Finally,  it  reviews  some  implementation  techniques  that  can 
greatly  speed  up  the  algorithm. 
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In  Section  ‘.{.5  some  computational  result*  are  discussed  winch  were  obtained  from  a  test,  implementation 
of  the  algorithm  based  on  the  considerations  of  Section  3.4.  While  the  test  results  are  from  a  small  sample  of 
problems  and  are  therefore  preliminary,  they  are  still  encouraging  for  the  eventual  practical  implementation 
of  the  algorithm. 

Finally  the  current  status  of  this  research  is  reviewed  in  Section  3.6,  with  suggestions  about  some 
directions  for  future  development.  Particular  attention  is  paid  to  possible  enhancements  of  the  algorithm 
that  might  increase  its  applicability. 


3.2.  The  Matching  Propert  y  and  the  One  Row  Algorithm 

In  this  section  and  in  Section  3.3  it  is  assumed  that  the  matrix  A  in  (3.1.3)  has  full  rank.  The  effect  of 
removing  this  assumption  is  considered  in  Section  3.4. 


3.2.1.  The  Matching  Property 


To  illustrate  the  pitfalls  in  trying  to  solve  SP  solely  from  the  sparsity  pattern  of  A,  consider  the  following 
sparsity  pattern: 


rx  X 

0  0 

o' 

0  I  X 

X  X 

X 

v°  L° 

x  :  x 

x) 

In  order  to  make  the  first  row  sparser,  a  multiple  of  the  second  row  could  be  added  to  the  hrst  to  zero  out 
the  1,2  position.  However,  it  appears  that  the  1,3  1,4,  and  1,5  entries  fill-in  (change  from  a  zero  into  a 
non-zero)  because  of  Lhis  operation.  To  mitigate  the  fill-in,  the  multiple  of  row  3  that  turns  entry  l,  3  back 
into  a  zero  could  be  added  to  row  1.  The  combination  of  these  two  row  operations  gives  the  same  effect  as 
if  the  boxed  submatrix  of  (3.2.1),  which  is  certainly  non-singular,  was  used  to  turn  the  1,2  entry  into  a  zero 
while  keeping  the  1,3  entry  zero.  The  expected  result  is 


X  0  0  X  XN 

0  X  X  X  X  , 

0  0  X  X  xJ 


which  is  not  sparser. 

Consider  now  the  following  two  matrices  with  sparsity  pattern  (3.2.1),  transformed  as  ,d>ove. 


TAl  = 


n 

0 

^0 


- 1 
1 
0 


x 

Vo  o  iAo 


i  o  o  o' 

i  r  l  i  ij 

0  i  ‘  1  ‘L 


fl  o  o 

0  1  I 

Vo  o  i 


I  0  0  0\  /l  0  0  -1 

112  3  =  011  2 

0111/  Vo  0  1  1 


In  the  second  case,  the  sparsity  decreased  as  exjiectcd,  but  in  the  first  case,  the  sparsity  increased.  The 
reason  for  the  unexpected  behavior  of  /l1  is  that  the  boxed  subiualrix  has  rank  only  I,  noi  2,  which  caused 
cancellation  to  occur  in  columns  4  and  5.  When  a  gratuitous  zero  appears  outside  the  columns  we  were  trying 
to  affect  (in  the  example  we  were  trying  to  affect  columns  2  and  3,  and  Tor  A 1  gratuitous  zeros  appeared  in 
columns  4  and  5),  we  say  that  unexpected  cancellation  has  occurred. 

Predicting  unexpected  cancellation  can  lie  extremely  difficult.  The  next  theorem  shows  that  allowing 
unexpected  cancellation  makes  Ml’  essentially  intractable. 
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Theorem  3.2.1:  It  is  NP-Hard  to  solve  SP. 

Proof:  This  theorem  and  its  proof  arc  due  to  Stockmeycr  (1982).  It  is  claimed  only  that  SP  is  NP-IIard,  rather 
than  NP-Coinplete,  because  it  is  difficult  to  show  that  SP  is  in  NP.  However,  Stockmeyer  has  conjectured 
that  SP  G  NP  (see  Garcy  and  Johnson  (t979)  for  the  definition  of  NP). 

The  problem  that  we  shall  reduce  to  SP  is 

Simple  Max  Cut:  Given  an  undirected  graph  Q  =  (V,  E),  partition  the  nodes  of.  Q  into  P  and  V  \  P 
so  as  to  maximise 

A  proof  that  Simple  Max  Cut  is  NP-Completc  is  referenced  in  Garey  and  Johnson  (1979)  Problem 
[ND16]. 

Let  n  =  V  ',  m  —  \E\,  let  A(  C)  be  the  usual  (0,  l)  vertex-edge  incidence  matrix  of  Q,  and  let  A,  be  the 
n  X  2m  matrix  which  is  all  sero  except  for  row  «,  half  of  whose  components  are  +1  and  half  —1.  Let  e  be 
the  2m-vcctor  of  all  ones  and  let  /  be  the  (2 m(n  +  1)  +  l)-vector  of  ones.  Suppose  that  SP  could  be  solved 
for  the  matrix 

B{5)  =  Ol<°S)  A,  At  ...  An  o} 

Define  T*  to  be  a  matrix  so  that  T'  B{Q)  is  an  optima!  solution  to  SP.  Since  T*  is  non-singular,  it  must 
have  a  perfect  matching,  which  can  be  assumed  without  loss  of  generality  is  on  its  diagonal.  Also,  since  T 
stays  optimal  after  row  scaling  it  can  be  assumed  that  T  has  unit  diagonal.  Because  of  the  sixe  of  /,  it 
is  never  worthwhile  to  use  row  1  when  reducing  any  other  row,  and  hence  the  first  column  of  71*  must  be 

(1,0 . 0)r  Thus  no  choice  for  the  remaining  entries  of  the  first  row  of  T'  can  cause  it  to  be  singular. 

Because  of  the  column  sixe  of  the  A,,  and  since  all  entries  are  ±1,  it  is  helpful  to  use  every  other  row  in 
reducing  the  first  row.  Hence,  the  first  row  of  T'  must  be  (1,<i,<2,  •  •  •  ,cn),  where  Cj  =  ±1  for  all  t  €  V. 
Let  P  —  { t  f ,  =  4  1  }•  Then  the  number  of  non-xeros  in  the  first  row  of  the  reduced  matrix  of  B(S)  is 
clearly 

(2m(n  +  1)  +  1)  +  mn  +  (m  -  !{  {  i,j }  G  E  [  »  €  P,j  E  V  \  P}|).  (3.2.2) 

But  since  (3.2.2)  is  miniinixed  by  the  optimal  T,  P  also  solves  the  Simple  Max  Cut  Problem  for  Q.  □ 

This  proof  works  because  of  the  great  opportunity  for  unexpected  cancellation  in  a  (0,  l)-matrix.  The 
Droof  shows  in  particular  that  any  numerical  approach  to  SP  must  be  heuristic,  rather  than  aiming  for 
optimality.  In  order  to  “combinatorializc”  SP  and  bring  it  back  into  the  class  of  polynomial  algorithms,  an 
assumption  about  the  non-xeros  of  A  is  needed  that  effectively  rules  out  unexpected  cancellation. 

As  motivation  for  the  assumption  that  will  be  used,  consider  the  following  chain  of  implications  about 
an  n  X  n  matrix  B: 

rank  FJ  —  n  <=>  det  D  y£  Q  *=>  ^  sgnff  bi0i  ^  0 

O  * 

=->  there  is  a  permutation  <t  such  that  &(„,  ,  ■  ■  • » t>non  7^  0  (3.2.3) 

<=>  B  has  a  perfect  matching. 

As  the  example  of  A*  showed,  unexpected  cancellation  is  caused  by  submatriccs  whose  rank  is  less  than 
that  suggested  by  their  sparsity  pattern.  The  notion  of  what  rank  “should"  be  turns  out  to  be  that  if  a 
submatrix  can  be  permuted  so  that  it  has  a  non-zero  diagonal,  then  it  should  have  full  rank.  Since  having 
a  non-zero  diagonal  is  the  condition  in  (3.2.3),  rank  is  what  it  should  be  if  the  implication  in  (3.2.3)  goes 
backwards  as  well.  That  is,  if  B  has  a  perfect  matching,  then  it  should  have  full  rank.  Such  “generality”, 
“non-degeneracy",  “general  position”  or  “independence"  is  often  assumed  in  sparse  matrix  studies.  A  formal 
statement  of  this  property  is: 

Matching  Property  (MP):  A  has  (MP)  if  rank  Arc  =  M(Arc)  for  all  row  subsets  It  and  column 
subsets  C. 
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in  other  terminology,  (MP)  states  that  term  rank  and  numerical  rank  are  the  same  for  every  submatrix 
of  A.  In  the  example  above,  A2  has  (MP)  but  A 1  docs  riot  (it  fails  precisely  on  the  boxed  sulimatrix). 

It  is  interesting  to  consider  how  (MP)  relates  to  other  possible  assumptions.  A  very  strong  assumption 
would  be  that  the  non-zeros  of  A  act  essentially  like  independent  algebraic  indcterniinates.  That  is,  any 
entry  of  the  result  of  any  linear  algebraic  operation  on  A  can  be  expressed  as  a  multivariate  jiolynotnial  in 
formal  variables  xt),  one  for  each  non- zero  of  A.  An  entry  of  the  result  is  considered  to  be  zero  ouly  if  its 
multivariate  polynomial  is  identically  zero  Algebraic  independence  implies  (and  therefore  is  sLronger  than) 

(MP). 

A  similar  assumption  would  be  that  each  non-zero  of  A  is  perturbed  from  its  original  value  by  an 
independent  infinitesimal,  similar  to  the  construction  often  used  to  resolve  degeneracy.  This  [icrturbation 
assumption  also  implies  (MP).  Thus  (MP)  is  weak  relative  to  other  such  assumptions. 

Although  (MP)  is  not  particularly  stringent,  most  real-life  matrices  do  not  satisfy  (MP)  The  reason 
is  that  real  matrices  have  many  entries  which  arc  small  integers,  thereby  producing  submatrices  which 
violate  (MP).  We  shall  nevertheless  construct  an  algorithm  to  solve  SP  assuming  (MP)  because  Theorem 
3.2.1  implies  that  there  is  little  hope  of  solving  SP  without  such  an  assumption.  One  reasonable  heuristic 
approach  to  SP  is  to  solve  it  with  (MP),  and  then  to  apply  the  resulting  algorithm  to  matrices  which  do  not 
necessarily  satisfy  (MP).  Though  it  is  an  apparent  contradiction,  such  an  (as  yet  hypoi  hcui  .1)  algorithm 
would  be  an  “optimal  heuristic”  for  SP.  It  would  be  optimal  lor  those  matrices  that  satisfy  (MP),  and  it 
would  be  heuristic  for  the  others. 

3.2.2.  The  One  Row  Algorithm 

For  the  reason  discussed  above,  in  this  section  and  in  Section  3.3  A  is  assumed  to  satisfy  (MP)  In  order 
to  show  that  (MP)  implies  that  no  unexpected  cancellation  can  occur,  some  preliminary  discussion  is  needed. 

As  noted  in  Socti<  .i  3.1,  solving  SP  involves  constructing  a  non-singular  T  so  that  TA  is  as  sparse  .is 
possible.  By  (3.2.3)  T  must  have  a  perfect  matching,  and  by  permuting  indices,  it  can  be  assumed  that 
every  entry  on  the  diagonal  of  T  is  non-zero.  Scaling  the  rows  of  T  does  not  alTcct  the  sparsity  of  TA,  and 
hence  it  can  bo  further  assumed  that  t„  —  1,  »  —  1,2, .  . ,  m.  With  this  scaling,  row  i  of  T  specifies  an 
elementary  row  operation  U»  tie  performed  on  row  i  of  A,  namely  add  the  other  rows  of  A  lo  the  »th  row 
with  the  multipliers  specified  by  the  entries  in  row  «  of  T  Since  TA  is  supposed  to  be  sparser  than  A,  its 
»th  row  should  also  be  sparser,  which  leads  to  consideration  of: 

The  One  Row  Sparsity  Problem  for  Row  i  (ORSP,);  Find  (X*,A;  ^  »' }  so  that 

A,m  —  A,®  -r  )  ~  X&A&®  (3.2.4) 


is  as  sparse  as  possible. 

Once  ORSP,  is  solved,  the  hope  is  that  the  resulting  X  for  row  i  of  A  can  be  packed  into  row  i  of  T,  which 
can  then  be  used  to  solve  SP.  It  is  not  clear  that  the  resultant  T  is  non-singular  as  required;  nevertheless, 
in  the  rest  of  this  section  we  shall  concentrate  on  solving  OIISP,-. 

A  set  of  multipliers  {  X*  |  k  >  1  )  for  (3.2.4)  when  «  =  1  defines  the  following  index  subsets: 

U  =  {k>  I  |X*^0}, 

H  =  {  i  1  «i>  =  0  and  au  ^  0  }, 

S  =  (j  |  a,j  =  0  and  o(J-  =  0  and  ^  0  for  some  k  £  U  }, 

a  =  n  u  s, 

F  -  { i  i  «ij  5^  0  "1;  =  o  }, 

P  —  F  U  S  —■  {j  al}  —  0  and  a*,  ^  0  for  some  k  G  (J  },  and 
Z  =  {j  |Oiy  =  0). 
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That  is,  U  ia  the  set  of  used  rows;  //  is  the  set  of  hit  columns,  where  a  non-tcro  was  changed  to  a  xero;  5 
is  the  set  of  saved  columns,  where  a  lero  that  might  have  been  expected  to  be  filled-in  (since  a*,  ^  0)  was 
not  filled-in;  G  is  the  set  of  good  columns,  where  the  entry  was  actively  manipulated  for  the  better;  F  is 
the  set  of  fllled-in  columns;  P  is  the  set  of  potential  fill-in  columns;  and  Z  is  the  set  of  sero  columns.  As 
an  example  of  these  definitions,  the  index  sets  are  indicated  for  the  following  sparsity  pattern: 


X  X  X  X  X  X  0  0  0  0  00  old  row  1 


Vote  that  the  net  decrease  in  non-xeros  in  row  1  is  j//|  —  |F|,  so  that  solving  ORSPi  is  equivalent  to  solving 
inaxxV/l  -  1^!- 

Now  (MI*)  can  be  put  to  work.  The  next  theorem  states  the  intuitive  result  that  if  k  columns  of  row  1  are 
affected  for  the  good,  at  least  k  rows  must  be  used.  This  theorem  is  the  key  fact  that  rules  out  unexpected 
cancellation. 

Theorem  3.2.2:  For  any  X,  ]G'  <  \U\. 


Proof:  By  contradiction.  Assume  that  |C|  >  |t/|. 

If  M(Auc)  <  Iff  I  then  by  Proposition  3.1.1  A  must  look  like 


(3.2.5) 


where  the  row  subset  It  and  column  subset  C  are  defined  as  shown.  (When  we  say  that  a  matrix  “looks 
like”  a  picture  such  as  the  one  above,  we  mean  that  its  rows  and  columns  can  be  permuted  so  that  it  has 
the  form  shown.)  Otherwise  =  |f/|),  let  li  —  U  and  C  —  G.  In  cither  case,  A  has  a  submatrix 
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that  looks  like 

row  1 


(3.2.6) 


where  It  C  U,  C  C  G  and  |/2j  <  \C\.  Also,  the  rows  of  It  provide  the  only  non-iero  contribution  to  the 
new  row  l  in  the  columns  of  C ,  and  M(Arc)  =  j/J|.  The  set  of  columns  induced  by  a  maximum  matching 
in  A nc  is  denoted  by  N  as  shown  in  (3.2.6). 

Note  that  Arn  is  square  and  M(Arn)  —  R  ,  thus,  by  (MP),  Akn  is  non-singular  Now  Ain  t 
^rArn  —  Am  —  0  since  N  C  G,  so  that  X/<  must  be  the  unique  solution  of  \J{Arn  —  -  Ai/y.  Define 
R  =  R  u  {  l  }•  For  Jfc  €  It  let  Rk  =  R  \  { Jfc  },  and  for  j  G  C  \  N  let  N,  =  N  U  {j  }.  Then  Cramer’s  Rule 
implies  that 

X*  =  (-l)kdct  ARhN  for  Jfc  G  R.  (3.2.7) 

For  j  G  C  \  N  (which  is  non-empty),  using  (3.2.7)  and  the  expansion  of  the  determinant  by  cofactors  in 
reverse,  it  can  be  shown  that 

«i>=dct  AhNi.  (3.2.8) 

Note  that  j  G  C  C  C  implies  that  d(y  =  0.  If  t»i,  ^  0,  then  the  perfect  matching  in  AilN  can  be 
trivially  extended  by  aiy  to  a  perfect  matching  in  A^N  .  But  now  (MP)  implies  by  (3.2.8)  that  d,j  0,  a 
contradiction. 

Suppose  instead  that  ajy  =  0.  Then  j  G  S,  and  therefore  there  must  be  some  k  £  It  such  that  o*y  0. 
Since  k  G  It  C  U ,  it  follows  that  X*  jA  0.  Thus,  by  (3.2.6),  A  Rt  n  must  have  a  perfect  matching.  But  adding 
a*y  to  this  matching  gives  a  perfect  matching  of  AjiN>.  Once  again,  (MP)  and  (3.2.8)  imply  that  dt;  ^  0,  a 
contradiction,  l] 

Corollary  3.2.3t  M(Ayc)  —  |G|;  thus,  by  (MP),  rank  Am;  =  |G|. 

Proof:  If  M(Aug)  <  jC|,  then  by  Proposition  3.1.1  A  must  look  like  (3.2.5),  which  would  again  lead  to  a 
submatrix  of  A  like  (3.2.6),  which  cannot  exist  by  Theorem  3.2.2.  D 

Theorem  3.2.2  and  its  corollary  prove  that  there  can  be  no  unexpected  cancellation  when  (MP)  is 
satisfied.  Indeed,  (3.2.6)  is  precisely  a  picture  of  what  is  meant  by  unexpected  cancellation. 

It  is  possible  to  have  \U\  >  |G|,  but  it  causes  additional  work  with  no  further  increase  in  sparsity.  A 
subset  R  C  U  could  be  selected  that  perfectly  matches  into  G  (which  is  possible  by  Corollary  3.2.3),  and 
the  non-scro  part  of  X  computed  as  the  solution  of 

X£ARCj  =  -Ai«.  (3 -2.9) 

which  gives  the  same  result  with  less  work.  Conversely,  suppose  that  a  square,  non-singular  submalrix  Arq 
is  used  to  to  sero  out  Aia  via  (3.2.9).  Then  Theorem  3.2.2  implies  that  no  columns  outside  G  are  affected 
for  the  good  (either  hit  or  saved). 

Hence  it  can  be  assumed  without  loss  of  generality  that  \U\  —  |G|.  Since  X  is  now  uniquely  determined 
by  an  equation  like  (3.2.9),  U  and  G  can  he  thought  of  as  determining  X  rather  than  vice-versa.  Thus  ORSP, 
has  been  reduced  to  the  more  combinatorial  problem  of  finding  optimal  U  and  G. 

Since  finding  X  requires  solving  a  system  of  linear  equations  of  sixe  |(/|,  if  there  are  several  different 
optimal  U  the  smallest  cardinality  one  is  preferred  in  order  to  minimise  work.  Actually,  as  was  mentioned 
in  Section  3.1,  since  work  is  proportional  to  the  number  of  non-xeros,  the  number  of  non-xcros  in  Ay  a  should 
be  minimised,  but  \U\  makes  an  acceptable  substitute.)  Formally  stated,  we  would  ideally  like  to  solve: 
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The  Strong  ORSPit  Find  an  optimal  solution  to  ORSP,  that  minimizes  Jt/|. 

A  little  reflection  over  the  definitions  of  the  index  sets  F,  S,  II  and  P  reveals  that  they  can  be  easily 
determined  combinatorially  from  U  and  G.  In  fact,  P  depends  only  on  U  (in  words,  P  is  the  set  of  not- 
idcntically  zero  columns  j  of  Ay.  such  that  atJ  —  0),  and  so  it  will  be  denoted  by  P(U).  The  set  (7  merely 
determines  how  P(U)  is  split  up  into  F  and  S.  These  observations  lead  to  an  easy  proof  that  the  gain  of 
zeros,  \H\  —  |Fj,  depends  only  on  (J,  not  on  G : 

Theorem  3.2.4:  bet  G\  and  Gt  be  two  sets  of  columns  that  perfectly  match  into  U ,  and  denote  the  set 
of  hit  columns  corresponding  to  <7j  by  //,-,  •'  =  1,2,  etc.  Then 

IffiMfil-ltffl-lftl. 

Proof:  It  is  easy  to  see  that  (//,-(  —  \U\  —  (5,|  and  [Pif  =  \P(U)\  -  (5<(,  so  that  (//,(  —  |P,|  =  |(/|  —  |/>(C/)|, 

i=l,2.  a 

The  proof  shows  that  solving  OltSP,  is  equivalent  to  solving 

maxjf/l  -  !P(t/)|.  (3.2.10) 

Since  A  has  full  rank,  every  U  must  perfectly  match  into  some  column  subset  G;  thus,  the  maximization  in 
(3.2.10)  is  over  all  U .  Define  R  =  {2, 3,  ...,m)  and  TJ  —  R  \U ,  and  call  Arz  the  first  zero-section  of 
A.  By  the  definition  of  P{U),  Arz  must  look  like 

P(U) 


Thus  every  non-zero  in  Arz  >a  contained  in  either  a  column  of  P[U)  or  a  row  of  U.  Consider  R  and  Z  to 
be  disjoint  sots  whose  union  is  the  set  of  lines  of  Arz-  Then,  since  every  non-zero  of  Arz  is  in  a  line  of 
P[V)  Jt7,  Ar7  is  said  to  be  covered  by  the  lines  in  P{U)  U  TJ .  Conversely,  suppose  that  L  is  a  cover  of  the 
first  zero-section  by  lines  with  a  minimal  number  of  columns.  Then  define  U  =  R  \  {rows  in  L}.  Since  L 
has  a  minimal  number  of  columns,  the  columns  in  L  must  be  P(U)  (otherwise  a  column  could  be  dropped 
from  /.and  it  would  still  cover  Arz,  contradicting  minimality).  Hence,  covers  by  lines  with  minima!  columns 
correspond  to  subsets  IJ  C  R. 

But  now  OllSI’i  has  been  reduced  to 

max|f/!  -  |P(f/)|  =  (m  -  1)  -  min(|P(f./)!  +  |Z7J) 

—  (ro  -  1)  -  min  \L\t 

1.  covers 

so  that  finding  a  minimum  cardinality  cover  of  the  first  zero-section  also  solves  OltSPi. 

The  classic  combinatorial  duality  theorem  of  Kiinig  and  Bgcrvary  (llyscr  (1963),  Theorem  5.1)  shows 
that  a  minimum  cover  in  (3.2.1 1)  eari  be  computed  via  a  maximum  matching  in  Arz'- 

Theorem  3.2.5:  M(Arz)  --  min{  |/.|  I  /.covers  Arz  }.  and  maximum  matchings  and  minimum  covers 
are  dual  combinatorial  objects  (which  means  that  any  algorithm  that  computes  one  must  also  compute  the 
other,  at  least  implicitly).  C 


(3.2.11) 
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Finding  maximum  matchings  in  bipartite  graphs  is  a  well-studied  problem,  for  which  many  polynomial 
algorithms  have  been  developed.  Most  such  algorithms  (i ad  a  maximum  matching  by  labelling.  If  labels 
are  started  at  the  row  side  of  the  bipartite  graph,  then  at  the  iinal  iteration  of  a  labelling  algorithm  the 
labelled  rows  and  columns  arc  those  reachable  by  a  partial  augmenting  path  from  an  unmatched  row.  The 
dual  minimum  cover  can  be  extracted  from  the  final  labels  as  {  labelled  columns}  u  {  unlabelled  rows}  (see 
Lawler  (1076),  p.  193).  Since  U  is  complementary  to  the  rows  in  a  minimum  cover,  U  is  the  set  of  labelled 
rows  at  the  end  of  a  maximum  matching  algorithm  (and  P(U)  is  the  set  of  labelled  columns).  In  matrix 
terms,  if  the  first  zero-section  is  the  matrix  in  (3.1.5),  then  IJ  is  the  set  of  rows  of  C .  This  gives  a  polynomial 
algorithm  for  solving  ORSPi-  ^ 

There  is  still  more  juice  that  can  be  squeezed  out  of  the  maximum  matching.  Recall  that  it  is  preferable 
to  solve  the  Strong  ORSP,-,  t.e.,  find  the  minimum  cardinality  U  among  all  optimum  U .  Note  that  it  is 
easy  to  turn  the  maximum  matching  problem  on  Aug  into  a  network  flow  problem  (sec  Section  2.5).  In  the 
network  flow  context,  the  dual  object  to  a  maximum  flow  is  a  minimum  cut.  At  an  optimum  of  a  network 
flow  problem,  a  minimum  cut  can  be  extracted  as  the  set  of  vertices  reachable  from  the  source  by  a  partial 
augmenting  path  (as  above  for  minimum  covers).  Let  the  minimum  cut  computed  <n  this  way  be  called 
the  standard  minimum  cut.  Since  labels  for  maximum  flow  and  maximum  matching  are  isomorphic,  the 
standard  minimum  cut  (standard  minimum  cover  in  matrix  terms),  must  be  equal  t o  (7u  /'((/). 

Theorem  3.2.6:  In  any  network  the  standard  minimum  cut  is  a  subset  of  every  minimum  cut.  Thus  the 
standard  minimum  cut  is  the  same  for  every  optimal  flow.  Hence  its  definition  is  independent  of  Lh<  optimal 
flow  used  to  compute  it,  and  it  has  minimum  cardinality  among  all  minimum  cuts  (sec  Ford  and  Fulkerson 
(1962),  p.  13).  □ 

Since  the  standard  minimum  cut  and  the  standard  minimum  cover  have -complementary  sets  of  rows, 
U  is  the  set  of  rows  in  the  standard  minimum  cut.  Such  a  U  is  already  known  to  solve  ORSP,-  by  (3.2.11). 
Theorem  3.2.6  states  that  the  standard  U  is  unique,  solves  the  Strong  ORSP,,  and  can  be  found  at  no 
additional  expense.  Returning  to  the  partition  (3.1.5),  if  Auz  has  several  such  partitions  in  which  C  has 
different  sizes,  then  by  Theorem  3.2.6,  maximum  matching  automatically  generates  the  partition  in  which  C 
has  the  fewest  rows  possible.  In  ad  ’ition  to  culling  down  on  the  amount  of  work  needed  to  solve  equations  for 
X  in  practice,  this  theorem  has  important  theoretical  implications  that  are  explored  in  subsequent  sections. 

Note  that  there  is  a  strong  asymmetry  in  choosing  which  side  of  the  network  is  Lhc  source.  The  above 
discussion  applies  when  the  maximum  matching  or  network  flow  is  started  from  the  row  side.  If  it  is  started 
from  the  column  side,  then  the  largest  U  that  is  optimal  for  ORSP,  is  generated  instead  of  the  smallest. 

We  shall  now  put  all  the  pieces  together  into  an  algorithm  for  solving  (the  Strong)  ORSP,. 

The  One  Row  Algorithm  for  Row  «  (ORA,): 

0.  Perform  a  maximum  matching  in  the  ith  zero-section  (starting  from  the  row  side). 

1.  Construct  Ui  as  the  set  of  labelled  rows  at  the  end  of  the  matching. 

2.  Find  a  column  subset  Gi  that  perfectly  matches  into  Ui  (then  Ay, o',  is  non-singular  by  (MP)). 

3.  Solve  Xj  Ai/,c,  =  -A.c,. 

4.  Set  A,,  to  A*.  +  Xy  Ay,.. 

Note  that  Step  2  allows  considerable  freedom  in  choosing  (7,,  a  point  whose  importance  is  shown  later. 


3.3.  Theoretical  Algorithms  for  SP 

We  would  like  to  combine  the  local  solutions  to  ORSP,,  i  =  1,2,  ...,m  into  a  global  solution  for  SP. 
However,  this  process  requires  care.  For  example,  consider  the  matrix 


A  = 


(3.3.1) 


I 


Denote  the  (unique  by  Theorem  3.2.6)  optimal  U  for  ORSP,  by  U\.  Then  for  (3.3.1),  U\  =  {2}  and 
Ug  =  {  I  }.  (When  the  »th  zero-section  has  no  columns  as  in  this  case,  the  bipartite  graph  of  the  zero  section 
has  only  left  (row)  vertices  and  no  edges.  Hence  a  matching  algorithm  terminates  after  it  has  initially 


I 


46  Making  Sparse  Matrices  Sparser 


Chapter  3 


labelled  all  of  the  rows.  Since  U‘  is  the  set  of  labelled  rows  at  optimality,  when  row  t  is  completely  dense 
U\  —  {1,2,  \  {*}.)  At  row  1  Gi  can  be  chosen  to  be  {  1 },  giving  At.  =  (0  —  1).  In  row  2,  G% 

can  also  be  chosen  to  be  {  1  },  giving  A^.  =  (0  1).  With  these  choices  the  final  result  is 


which  is  singular.  This  illustrates  that  the  G,  cannot  be  chosen  arbitrarily  in  Step  2  of  ORA,-. 

3.3.1.  The  Parallel  Algorithm 

There  is  a  fairly  natural  way  to  choose  Gi  in  Step  2  that  avoids  the  potential  difficulty  above,  and  that 
also  saves  time  l.o:,  ,M  be  a  fixed  maximum  matching  for  A  (X  must  be  row-perfect  since  A  has  full  row 
rank).  Once  is  determined  in  Step  1  of  ORA,-,  G,  can  be  chosen  to  be  the  set  of  columns  that  U*  matches 
to  under  X.  Note  that  A,  the  output  of  running  ORAj  for  t  =  1,2,  ...,m,  equals  T A,  where  the  »th  row 
of  T  has  one  in  the  diagonal  position,  and  is  the  X  from  running  ORA,-  elsewhere.  Define  T  to  be  the  T 
obtained  by  choosing  the  6\  relative  to  X.  Thus  when  i  ^  j,  jA  0  if  and  only  if  j  j£  U\ . 

Suppose  that  it  could  be  shown  that  T*  is  non-singular,  so  that  it  is  a  valid  candidate  for  a  T  to  solve 
SP.  Then  T  must  be  an  optimal  solution  to  SP.  Consider  any  other  candidate  T,  which  can  be  assumed 
without  loss  of  generality  to  have  a  diagonal  of  all  ones  (as  in  Section  3.2).  If  T  increased  the  sparsity  of  A 
more  than  T  ,  then  at  least  one  row  of  T A  would  have  to  be  sparser  than  the  corresponding  row  of  T' A. 
But  the  optimality  of  ORA,-  implies  that  every  row  of  T'  A  is  individually  as  sparse  as  possible.  Thus  T*  is 
optimal  if  it  is  non-singular. 

The  determination  of  the  non-singularity  of  T  depends  on  the  implications  of  the  uniqueness  of  the  U* 
for  the  structure  of  T* .  Define  a  directed  graph  D  with  nodes  {  1, 2, . . . ,  m  }  and  arcs  { ( k ,  t)  I  fc  €  (/*  },  so 
that  D‘  captures  the  sparsity  pattern  of  T* .  Such  a  directed  graph  can  be  similarly  defined  for  any  square 
sparse  matrix  T  with  non-*ero  diagonal. 

In  any  directed  graph  D,  node  j  is  defined  to  be  reachable  from  node  t  if  there  is  a  directed  path  from 
t  to  j.  The  relation 

t  —  j  <=»  t  is  reachable  from  j  and  j  is  reachable  from  t 

is  an  equivalence  relation  that  partitions  the  nodes  of  V .  A  class  of  this  partition  is  called  a  strong 
component  '>r  every  node  in  a  strong  component  C  is  reachable  from  every  other  node  in  C ■  If  a  node 
/  in  strong  rornoonent  Cj  is  reachable  from  a  node  k  in  strong  component  £7,-,  then  C,  precedes  C]t  which 
is  denoted  C,  ■<  C}  It  can  be  shown  that  the  <  relation  is  a  partial  order  on  the  strong  components. 

To  reflect  the  strong  component  partition  back  into  matrix  terms,  order  the  strong  components  of  D 
in  a  linear  order  consistent  with  -<,  and  order  nodes  arbitrarily  within  components.  If  the  corresponding 
principal  permutation  is  applied  to  T,  it  induces  a  block  lower  triangular  structure,  where  the  diagonal 
blocks  are  irreducible  and  correspond  to  the  strong  components.  This  decomposition  of  T  is  essentially 
unique  and  is  called  the  (square)  Dulmnge-Mendelsohn  decomposition  (or  DM  decomposition)  of  T. 
The  DM  decomposition  is  studied  more  closely  at  the  end  of  this  section.  Sec,  c.g.,  George  and  Gustavson 
(1980)  for  details  of  this  decomposition.  For  example,  if 


G\  Gj  Gi 


where  the  decompositions  of  T  and  D  are  indicated  by  boxes.  The  next  theorem  shows  that  the  block 
triangular  decomposition  of  T  has  a  very  special  structure. 
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Theorem  3.3.1:  If  l  ^  i,  l  €  U k  and  k  6  f/,,  then  l  6  U\. 

Proof:  For  ease  of  notation,  let  U  —  f/, ,  7X  -  {  1,2, . . .  ,m  }  \  {  *  }  \  l/*,  P  —  /‘(l/*)  and  P  —  Zi  \  P(U'). 
Thus  U  and  U  partition  the  rows  of  the  :th.  zero-section,  and  P  and  P  partition  its  columns.  By  definition 
of  P(U)  and  k,  the  ith  zero-section  looks  like 

P  P 


U  0  !  row  k 

1  I 


U 


Since  //*  corresponds  to  a  maximum  matching,  by  Proposition  3.1.1  A(-;/>  has  a  row-perfect  matching,  and 
so  at  least  j 0\  lines  arc  necessary  to  cover  it.  Note  that  since  Akp  —  0  and  k  (^XJ ,  A gp  is  a  submatrix  of 
the  klb  zero- section. 

Let  Lk  be  the  standard  minimum  cover  of  the  &th  zero-section  by  lines.  Recall  that  the  standard 
minimum  cover  has  the  largest  number  of  rows  among  all  minimum  covers.  Consider  the  set  of  lines  Lk  — 
L\  U  U  \  P-  The  set  /.*  <locs  not  contain  the  columns  of  P,  and  so  it  might  not  cover  Amp,  which  is  part 
of  the  fkth  zero-section.  Since  the  only  difference  between  Lk  and  L\  is  in  lines  passing  through  Amp,  L * 
docs  cover  the  rest  of  the  kli>  zero-section.  But  the  only  non-zero  rows  of  A m-p  are  the  rows  in  U ,  and  since 
V  C  Lk,  Lk  also  covers  the  kth  zero-section. 

l,\  and  Lk  have  the  same  set  of  lines  outside  Ag-p.  Since  Ag-p  has  a  row-perfect  matching,  the  number 
of  lines  of  L'k  passing  through  Agp  must  be  at  least  \U |.  The  lines  in  /,*  passing  through  Ag-p  are  precisely 
JJ ,  the  minimum  possible  number,  and  so  overall  Lk  contains  at  most  as  many  lines  as  /.’  But  since  L\  is 
minimum,  Lk  must  also  be  a  minimum  cover 

Finally,  note  that  /,*  contains  at  least  as  many  rows  as  Lk  docs.  But  since  Lk  is  the  (unique)  standard 
minimum  cover,  Lk  has  the  maximum  possible  number  of  rows  among  all  minimum  covers  Thus  Lk  must 
equal  L‘k,  so  that  U  C  {  rows  of  Lk  }  Taking  complements  gives  f/,  o  {  :  }  D  U k  ^  {k},  which  gives 
t/fc  \  {  *  }  C  U*<  “  desired.  D 

In  graph  terms,  Theorem  3.3-1  implies  that  if  l,  k  and  «  arc  distinct  nodes  of  P*,  and  k)  and  ( k , »)  are 
edges  of  D  ,  then  (/, »)  must  also  be  an  edge  of  D  When  this  is  true  of  an  a  directed  graph,  it  is  said  to  be 
transitively  closed.  Applying  the  theorem  inductively,  j  is  reachable  from  t  in  P  if  and  only  if  (i,j)  is 
an  edge  of  9  .In  terms  of  the  block  lower  triangular  structure  of  T* ,  this  result  implies  that  the  diagonal 
blocks  of  T  are  all  completely  dense,  and  the  subdiagonal  blocks  are  cither  all  zero  or  completely  dense. 
In  particular,  if  *  and  j  arc  in  the  same  strong  component  C  of  D  ,  rows  t  and  row  j  of  T  have  the  same 
sparsity  pattern,  so  that  C  C  t/iu(i}  =  f/Jb(j}.  These  insights  into  the  structure  of  T  are  crucial  to 
the  proof  of  the  next  theorem. 

Theorem  3.3.2:  T*  is  non-singular. 

Proof:  It  can  be  assumed  without  loss  of  generality  that  M  matches  row  »  to  column  :  in  A,  i  =  1,2, ...  ,m, 
and  that  the  indices  of  A  and  T  are  ordered  so  that  7'*  is  block  lower  triangular.  Since  7’*  is  block  lower 
triangular,  it  suffices  to  show  that  each  diagonal  block  of  T  is  non-singular. 

Recall  that  the  effect  of  ORA*  is  to  make  A,c.  =  0.  By  the  choice  of  M,  and  G,  relative  to  X,  G,  =  U i , 
so  that  AtU’  =  0.  Since  i  the  blocks  of  A  corresponding  to  the  diagonal  blocks  of  T’  arc  diagonal 

matrices;  the  subdiagonal  blocks  of  A  corresponding  to  dense  subdiagoual  blocks  of  T*  are  completely  zero. 
Thun  a  typical  A  and  T  might  look  like 

(F  0  0 \  f*  *  *  *\  (D  *  *  *\ 

0  F  0  •  *  *  *  *  =  *  D  *  •  J, 

Vo  f  fJ  V*  *  *  */  V*  o  i)  */ 

T’  A  A 
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where  “F”  represents  a  full  or  dense  submatrix,  “D"  represents  a  diagonal  submatrix,  “0”  represents  a  xero 
submatrix,  and  represents  an  arbitrary  submatrix. 

Denote  the  set  of  indices  of  a  typical  diagonal  block  of  T  by  C.  For  *  €  C,  let  Uc  =  U‘  U  {« };  as 
discussed  above,  this  definition  is  independent  of  t,  and  C  C  Uc-  If  C  =  Uc,  as  is  the  case  with  the  first 
two  blocks  in  the  example,  then  T*ccAcc  =  Acc,  which  is  diagonal,  and  which  has  non-tero  diagonal  since 
there  can  be  no  unexpected  cancellation.  Since  M  restricted  to  Acc  is  a  perfect  matching  for  Acc,  Acc  is 
non-singular  by  (MP).  But  then  T'cc  —  Acc  Acc,  an<I  30  ^cc  is  non-singular. 

Suppose  instead  that  C  C  Uc,  as  in  the  third  block  of  the  example,  and  define  L  =  Uc  \  C.  Thus 

A  (  All  Alc\ 

Aucuc  =  {Acl  AcJ. 

The  submatrix  Tec  satisfies  the  following  equations  by  definition  of  T‘ ,  C  and  L: 

tci.All  +  Tec  Act  «  0  (3  3.2) 

Tc/Alc  +  TCcAcc  —  Acc,  which  is  diagonal. 

Since  M  restricted  to  Aucuc  '3  a  perfect  matching  for  both  Aucuc  an<I  All,  by  (MP)  both  A[/Cuc  and  All 
arc  non-singular.  Since  All  is  non-singular,  T Cc  ea®  ^  partially  solved  for  in  (3.3.2)  to  get 

T*cc(Acc  -  AclAZlAlc )  =  Acc •  (3.3.3) 

The  matrix  Acc  -  AclA^lAlc  in  (3.3.3)  is  called  the  Sehur  complement  of  ALl  in  Aucuc>  denoted 
( AucUc/Ali ,)■  It  is  well  known  (see  Cottle  (1974),  equations  (2)  and  (4))  that  when  Aucuc  is  non-singular, 
{AucVc/Au.)  >s  non-singular  if  and  onty  if  All  is  non-singular.  Since  All  «  non-singular  here,  Tcc  can 
be  fully  solved  for  in  (3.3.3)  as  the  product  of  two  non-singular  matrices,  implying  that  T^c  18  non-singular 
in  this  case  as  well.  Q 

Since  T*  is  non-singular,  it  can  be  used  to  transform  A  into  A.  Generating  A  via  T*  processes  each 
row  in  parallel,  t.e.  each  row  is  solved  relative  to  the  original  matrix  rather  than  relative  to  a  partially 
transformed  matrix.  We  call  this  procedure  the 

Parallel  Algorithm  (PA): 

Find  a  maximum  matching  M  of  A. 

For  *  =  1 , 2 . m 

Generate  row  i  of  A  from  A  using  ORA,-,  picking  (7,  relative  to  M. 

End. 

The  results  in  this  section  make  it  easy  to  prove  the  next  theorem. 

Theorem  3.3.3:  PA  solves  SP. 

Proof:  T‘  is  non-singular  by  Theorem  3.3.2.  Since  every  row  of  A  is  made  as  sparse  as  possible  in  A,  T* 
must  be  optimal.  D 

3.3.2.  The  Sequential  Algorithm 

The  parallelism  of  PA  seems  unsatisfactory  for  three  reasons.  First,  it  is  more  natural  to  process  A 
sequentially,  i.e.  by  solving  each  row's  matching  problem  on  the  partially  reduced  A  whose  previous  rows 
have  already  been  processed.  Second,  by  processing  A  sequentially,  A  can  be  overwritten  on  A,  thereby 
saving  space.  Also,  it  is  shown  later  that  the  optimal  U’ s  can  only  become  smaller,  which  saves  time  in 
solving  equations  (3.2.9).  Third,  and  most  important,  with  PA  the  flexibility  in  choosing  (7,-  at  Step  2  of 
QUA,  is  lost.  Having  flexibility  in  choosing  C ,•  is  important  when  the  algorithm  is  applied  to  real  problems 
that  might  not  satisfy  (MP).  It  is  important  because  in  practice  Afj-C.  might  be  singular  despite  having  a 
perfect  matching. 

For  these  reasons,  we  consider  the  following  algorithm,  which  overcomes  the  objections  to  PA: 
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Sequential  Algorithm  (SA)t 
For  i  =  1,2, ...  ,m 

Apply  ORA,  to  A,  choosing  (7,  in  any  way  desired,  as  long  as  Ay-Gi  is  non-singular. 

Replace  row  t  of  A  by  the  generated  row  >  of  A. 

End. 

Note  that  A  is  dynamically  changed  every  time  through  the  loop  in  SA,  so  that  (Ml’)  might  no  longer  hold 
for  the  partially  transformed  A.  However,  we  know  of  no  example  where  (MP)  fails.  By  the  nature  of  ORA,-, 
each  iteration  of  the  loop  in  SA  is  equivalent  to  left- multiplication  of  A  by  an  elementary  transformation. 
That  is,  at  iteration  i  of  the  loop,  the  current  A  is  left-multiplied  by  mat  rix  E' ,  where  E'  is  the  identity 
except  that  row  t  of  E'  is  the  X  from  ORA*.  Since  all  the  E'  are  non-singular,  the  output  matrix  of  SA  is 
equivalent  to  the  original  A. 

It  was  easy  to  prove  that  PA  produces  an  optimally  sparse  answer,  but  difficult  to  prove  that  its  output 
matrix  is  equivalent  to  the  input.  For  SA,  the  situation  is  just  the  reverse.  The  next  theorem  uses  the 
optimality  of  PA  to  show  that  SA  is  optimal. 

Theorem  3.3.4:  SA  produces  the  same  final  number  of  non-zeros  as  PA,  and  hence  SA  also  solves  SP. 

Proof:  The  U  computed  for  row  :  by  ORA,  in  PA  is  still  denoted  by  U’ ,  and  the  (possibly  different)  U 
computed  for  row  i  in  SA  is  denoted  by  {/,.  Denote  by  A'  the  partially  reduced  A  at  the  beginning  of  the 
tth  iteration  of  SA  (just  before  replacing  row  i),  so  that  the  original  input  A  equals  A1.  The  proof  is  by 
induction  on  the  row  index  »;  we  shall  prove  row  by  row  that  SA  produces  the  same  reduction  in  non-zeros 
as  PA.  The  hypothesis  (/*  C  l/j  for  all  k  <  t  is  also  carried  through  the  induction.  At  row  1,  U\  —  U i 
since  A  =  A1,  and  SA  and  PA  both  reduce  row  1  by  the  same  amount,  proving  the  base  of  the  induction. 

For  row  i,  let  R  =  {  1, 2, . . .  ,m  }  \  { t  }  and  X  be  the  row  and  column  indices  of  the  tth  zero-section  for 
A*.  Since  row  t  has  not  yet  been  changed  in  A',  R  and  Z  are  also  the  index  sets  of  the  *th  zero-section  for 
A1.  Recall  that  L*  =  R  \  U\  U  P{U ,)  is  the  standard  minimum  cover  of  ARZ  by  lines.  The  first  claim  is 
that  L *  is  also  a  covering  of  A'KZ  by  lines.  Let  P  —  Z  \  P(U’).  The  only  way  for  I,"  not  to  be  a  cover  is  if 
the  result  of  a  previous  operation  has  introduced  a  non-zero  into  A'^.p  (the  part  of  the  »th  zero-section  not 

covered  by  L‘). 

Suppose  that  the  first  non-zero  introduced  into  A'(J.p  is  in  row  l  or  U i ,  and  that  it  originated  from  row 

k  while  processing  row  l  <  t,  so  that  k  G  Ui .  By  the  induction  hypothesis,  f//  C  U\,  implying  that  k  G  U]. 
Since  l  G  t/,  and  k  G  U  ( ,  and  since  k  clearly  cannot  equal  i,  by  Theorem  3.3.1  k  G  U  t.  Since  iteration  /  is 
the  first  one  where  A'(J.p  became  non-zero,  and  since  k  G  U\,  Alkp  is  zero.  But  then  row  k  cannot  introduce 

a  non-zero  into  A'^.p.  Thus  A^.p  must  still  be  zero,  and  so  L*  is  a  cover  for  A'RZ. 

Now  |L*|  —  M[Axrz)  (by  Theorem  3.2.5) 

=  rank  ARZ  (since  (MP)  does  hold  for  A1) 

—  rank  A'RZ  (since  A'RZ  is  a  non-singular  transformation  of  AlRZ) 

<M{A'nz)  (by  (3.2.3)) 

<  |L  |  (since  any  cover  provides  an  upper  bound  for  M[A){Z).) 

Thus  | Ij  j  =  M(A\rz),  and  so  by  Theorem  3.2.5,  //  must  be  a  minimum  cover  for  A'RZ.  However,  f* 
might  not  be  the  minimum  cover  whose  complementary  U  is  smallest.  Since  U*  is  the  complementary  U  for 
L  ,  and  (J,  is  the  complementary  U  for  the  standard  minimum  cover  for  A'RZ,  Theorem  3.2.6  ensures  that 
Ui  C  U i .  This  verifies  one  v>.‘  the  induction  hypotheses. 

Recall  from  (3.2.11)  that  the  reduction  in  non-zeros  from  solving  ORSP,  is  (m  -  1)  —  |/,|,  where  L  is 
a  minimum  cover  for  the  *lh  zero-section.  The  chain  of  equalities  above  shows  that  M(AlRX  )  =  W(A^). 
By  Theorem  3.2.5,  the  minimum  covers  for  ARZ  and  A'RZ  then  have  the  same  size,  and  the  reduction  in 
non-zeros  is  the  same  for  both.  But  the  reduction  in  non-zeros  for  A){z  is  achieved  by  PA,  which  is  optimal 
for  row  *,  and  so  SA  must  also  be  optimal  for  row  i.  0 

The  proof  of  Theorem  3.3.4  produces  the  bonus  that  the  sequential  f/’s  are  (if  anything)  smaller  than 
the  parallel  f/’s,  so  that  SA  needs  to  solve  smaller  linear  systems  to  obtain  X. 
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3.3.3.  The  SP  Decomposition 

The  proof  of  Theorem  3.2.2  shows  that  the  U‘  are  strongly  interrelated,  and  that  their  joint  struc¬ 
ture  is  related  to  the  block  triangular  decomposition.  Block  triangular  decomposition  was  developed  by 
Dulmage  and  Mendelsohn  (1963)  for  rectangular  matrices,  but  our  interest  is  in  DM  decomposition  of  square 
matrices,  as  discussed  in  the  remarks  before  Theorem  3.2.2.  The  theory  of  decompositions  has  recently  been 
considerably  generalised  by  Iri  (1983). 

Let  M  be  a  row-perfect  matching  in  A,  and  let'm(t')  be  the  column  to  which  row  «  matches  under  X. 
Define  the  directed  graph  Dm  to  have  nodes  {  1, 2, ....  m  }  and  arcs  {  (»,  j)  |  ^  0  }.  We  say  that  row 

j  is  reachable  from  row  *  via  M,  denoted  i  — »  j,  if  j  is  reachable  from  i  in  D m-  For  example,  if 


0  0\ 

A  =  X  ®  X  , 

VO  X  ®/ 

and  M  is  the  circled  matching,  then  1  — *  3  but  3^1.  Note  that  this  concept  essentially  depends  only  on  the 

square  submatrix  of  A  induced  by  the  matched  columns  of  M .  The  next  theorem  illuminates  the  connection 
between  the  //,  and  DM  decomposition. 

Theorem  3.3.5:  i  £  U *  if  and  only  if  t  -»  j  for  all  row-perfect  matchings  X  of  A. 

Proof:  Two  facts  arc  needed  for  this  proof.  Fact  1  is  that  any  (partial)  matching  M  can  bo  extended  to  a 
maximum  matching  W  that  uses  the  same  columns  as  M  (sec  Lawler  (1976),  Theorem  5.4.1).  Fact  2  is  that 
t  -/*  J  if  and  only  if  the  square  submatrix  of  A  induced  by  the  columns  of  M  can  be  partitioned  like 


M 


® 

® 

® 

® 

® 

— 

0 

® 

® 

® 

N 


row  j 


row  t 


(3.3.4) 


where  R  is  the  subset  of  rows  k  for  which  i  -*  k,  M  is  the  subset  of  columns  matched  under  M,  and 

M 

/V  =  {j  '  m(k)  =  j  for  some  k  £  R}. 

(Proof  of  =»)  Assume  that  i  £  U 7,  but  that  there  is  a  row-perfect  matching  M  such  that  »  ■/>  j,  and 

so  by  Fact  2,  A,m  looks  like  (3.3.4).  Let  R  be  M  restricted  to  A./y,  and  note  that  A./y  is  part  of  the  yth 
/.ero-section.  Using  Fact  I,  extend  R  to  a  maximum  matching  in  the  jth  »ero- section,  which  then  looks  like 


0  0  0 

0  0  0  0  0  0  0 

unmatched 

T 

rows 

1 
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Because  of  the  zeros  above  the  matched  part  of  A./v,  there  is  no  way  to  label  a  row  of  R  starting  at  an 
unmatched  row.  But  then  i  tfUj,  a  contradiction. 

(Proof  of  4=)  Assume  that  »  — *  j  for  all  row-perfect  M,  but  that  i  (/ U’}.  Since  i  $fU’,  i  is  not  in  the 

M 

standard  minimum  cover  of  the  jth  aero-section.  Hence,  the  y'th  aero- section  with  a  maximum  matching  M 
in  it  must  look  like _ 


rOOOOOOOOOO  row; 


By  Pact  1,  M  can  be  extended  to  a  row-perfect  >7  in  all  of  A  that  uses  die  same  columns  as  M ,  so  that  A 
looks  like 


But  now  i  /•  j  by  Fact  2,  a  contradiction.  □ 

H 

The  remarks  after  Theorem  3.3.1  show  that  the  (/*  induce  a  decomposition,  which  we  call  the  SP 
decomposition,  on  the  rows  of  A.  That  is, 

»  ~  j  «=>  i  6  U  j  and  j  G  U , 

is  an  equivalence  relation  which  induces  an  ordered  partition  on  the  rows  of  A.  Stated  another  way,  »  —  j 
if  and  only  if  i  and  j  belong  to  the  same  diagonal  block  of  T  .  Because  of  the  uniqueness  of  the  U\,  the 
SP  decomposition  is  an  intrinsic  or  canonical  property  of  a  sparse  matrix.  Theorem  3.3.5  was  motivated  by 
curiosity  about  the  relationship  of  the  SP  decomposition  to  the  DM  decomposition,  which  is  also  canonical. 
The  content  of  Theorem  3.3.5  is  that  the  SP  decomposition  is  the  coarsest  partition  of  the  rows  that  is  a 
refinement  of  every  square  DM  decomposition. 

3.3.4.  The  Complexity  of  the  Null  Space  Problem 

Another  interesting  theoretical  consequence  of  PA  and  SA  applies  to  computational  complexity.  Our 
research  into  SP  was  originally  motivated  by  a  different  problem,  the 

Null  Space  Problem  (NSP)s  Let  m  <  r».  Given  a  sparse  mX  n  full-rank  matrix  A,  find  an  (n— m)  X  n 
null  space  matrix  7,  of  full  rank  so  that  (l)  ZA1  —  0  and  (2)  '/  is  as  sparse  as  |H>ssiblc. 

NSP  is  also  an  important  problem  to  solve  for  large-scale  constrained  optimisation,  since  so-called  null 
space  methods  need  such  a  7.  to  operate  efficiently  (see  Gill,  Murray  and  Wright  (1981),  Section  5.1.3;  see 
also  Kaneko,  Lawo  and  Thierauf  (1982)  for  a  heuristic  approach  to  NSP  for  a  subclass  of  matrices). 

Suppose  that  there  were  a  polynomial  algorithm  for  solving  NSP  (say,  Algorithm  /).  Apply  Algorithm 
Z  to  A  to  obtain  an  optimal  7,  .  Now  apply  Algorithm  Z  to  7“  to  obtain  /**.  Since  A  spans  the  null  space 
of  7,  ,  by  simple  linear  algebra  7  must  be  equivalent  to  A.  If  there  were  any  other  matrix  equivalent  to 
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A  and  sparser  than  Z“ ,  then  Z **  would  not  be  an  optimal  null  space  basis  for  Z* .  Thus  Z“  must  be  the 
sparsest  possible  equivalent  matrix  to  A,  so  that  Z  solves  SF  for  A.  In  a  vague  sense  then,  SP  =  (NSP)1. 

It  is  fascinating  that  solving  NSP  is  NP-Coinpletc,  even  with  a  stronger  generality  assumption  about  A 
than  (MP).  Some  preliminary  discussion  is  herded  to  show  this  result. 

Suppose  that  there  is  a  vector  z  in  the  null  space  of  A  that  has  fewer  non-zero  entries  than  any  of 
the  rows  of  Z  .  If  z  is  adjoined  to  Z  ,  a  dependency  that  includes  z  is  created  among  the  rows  of  the 
augmented  matrix.  If  a  row  of  Z  which  is  dependent  on  z  in  the  augmented  Z'  is  deleted,  the  resulting 
matrix  still  spans  the  null  space  of  A,  and  is  even  sparser  than  Z  .  Therefore  an  optimal  Z  must  include 
a  sparsest  possible  null-space  vector. 

Now  consider  the  subset  C  of  columns  of  A  picked  out  by  the  non-*ero  entries  of  such  a  z  .  Since 
A.czc  —  the  columns  in  C  must  be  dependent.  By  (MP)  or  any  stronger  assumption,  since  A,c  does 
not  have  full  column  rank,  it  cannot  have  a  column-perfect  matching.  Ry  Proposition  3.1.1  A.c  looks  like 

C 


0 

® 

® 

® 

D 

with  column  subset  D  defined  as  indicated.  The  columns  of  D  must  also  be  dependent,  but  if  |Z)|  <  I C\, 
there  is  a  sparser  null-space  vector  than  z  based  on  D,  contradicting  that  z  is  sparsest  possible.  Thus  A,c 
must  look  like 


where  R  is  the  set  of  non-zero  rows  in  A,c ■  If  |-/?|  +  1  <  \C\,  a  column  of  C  could  be  dropped  and  the 
columns  in  C  would  still  be  dependent,  so  that  |/?|  +  1  =  \C\.  For  j  £  C,  let  Cj  =  C  \  {j  }.  Then  M(Arc,) 
must  equal  |  /? j  for  all  j  £  C,  for  otherwise  the  size  of  C  could  be  diminished,  again  contradicting  optimality 
of  z  .  Conversely,  given  such  a  C,  a  null-space  vector  z  whose  support  is  6’  can  clearly  be  constructed. 

A  column  subset  C  such  that  A.c;  has  |C|  —  I  non-zero  rows  R  and  such  that  M(Anc,)  —  |f?|  for  all 
j  £  C  is  called  a  circuit  of  A.  (It  is  called  this  because  it  is  a  circuit  of  the  malroid  generated  by  the 
columns  of  A.  A  circuit  of  a  matroid  is  a  minimal  dependent  set;  sec  Welsh  (1976).)  Thus  if  NSP  has  a 
polynomial  algorithm,  there  is  also  a  polynomial  algorithm  for  the  following  problem: 

The  Minimum  Circuit  Problem  (MCP)i  Given  an  m  X  n  sparse  matrix  A,  find  a  minimum 
cardinality  circuit  of  A. 

The  size  of  a  minimum  circuit  of  A  is  called  the  girth  of  A. 

Theorem  3.3.6:  MCP  is  NP-Completc,  and  thus  NSP  is  NP-IIard. 

I’roof:  This  proof  is  due  to  Stoc.kmcycr  (1982).  The  problem  that  we  shall  reduce  to  MCP  is 

The  m-CIique  Problem:  Given  an  undirected  graph  §,  determine  whether  Q  has  a  clique  of  size  m 
(a  clique  is  a  node-induced  complete  graph;  sec  Bondy  and  Murty  (1976),  Section  7.2). 
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A  proof  that  the  m-Clique  Problem  is  NP-Complete  is  referenced  in  Carey  and  Johnson  (1979),  Problem 
[GT19j. 

Given  a  graph  §  with  v  vertices  and  e  edges,  construct  a  (u  +  (™)  —  m  —  l)  X  c  sparsity  pattern  A(p) 
as  follows.  Index  the  first  v  rows  of  A(§)  with  the  vertices  of  Q,  and  its  columns  with  the  edges  of  Q.  Make 
Oi,{j,k)  7^  0  when  t  =  j  or  i  =  k,  so  that  the  first  t>  rows  of  A{Q)  have  the  same  sparsity  pattern  as  the 
vertex-edge  incidence  matrix  of  §.  For  row  i  >  v,  set  a,>  ^  0  for  all  e. 

Suppose  that  A(§)  has  a  circuit  C  with  |(7|  =  c  <  (’£).  Let  d  =  (™)  —  c,  so  that  d  >  0.  Since  C  is  a 
circuit,  A,c  has  c  —  !  non-zero  rows.  Because  the  (™)  —  m  —  1  non-vertex  rows  are  among  these,  A,c  has 
(c  —  1)  —  ((™)  —  m  —  l)  —  m  —  d  non-zero  vertex  rows.  Denote  the  set  of  such  rows  by  It,  and  denote  by 
/  the  number  of  edges  in  the  subgraph  of  Q  induced  by  the  vertices  in  R.  Since  |/f|  —  m  —  d  vertices  can 
induce  at  most  edges,  /  <  But  surely  all  the  edges  in  C  arc  among  those  induced  by  It,  so 

that  /  >  c  =  (’J1)  —  d.  Putting  these  inequalities  together  yields  (”)  —  d  <  e  <  However,  it  is  easy 

to  show  that  when  m  >  3  (which  can  be  assumed  without  loss  of  generality),  (™)  —  d  > 

Thus  every  circuit  C  of  A($)  must  satisfy  |Cj  >  (™).  Suppose  that  Q  has  an  m-cliquc,  say  on  the  vertices 
in  the  set  It  (so  that  |/f|  =  m),  and  with  the  f£)  edges  in  the  set  C.  Since  |/£|+(number  of  non-vertex  rows)  = 
(”)  —  1  =  |£7|  —  1,  it  is  easy  to  verify  that  C  is  a  circuit. 

Conversely,  suppose  that  A(§)  has  a  circuit  of  size  (”),  and  let  R  be  the  m  non-zero  vertex  rows  in 
A.c-  Fct  /  again  be  the  number  of  edges  in  the  subgraph  of  Q  induced  by  R.  As  above,  it  must  be  true 
that  /  >  |C|  =  (’£).  But  m  vertices  can  induce  at  most  (”)  edges,  implying  that  /  <  (”).  Hence  /  =  (”), 
and  the  vertices  in  It  are  an  m-clique. 

Thus  Q  has  an  m-clique  if  and  only  if  the  girth  of  A(Q)  is  (’£).  If  there  were  a  polynomial  algorithm  for 
MCP  it  could  be  used  to  determine  the  girth  of  A{Q),  and  thereby  determine  whether  $  has  an  m-clique. 
But  the  m-Clique  Problem  is  NP-Complete,  and  so  MCP  must  also  be  NP-Completc.  Therefore  solving  NSP 
is  NP-Hard,  since  (as  shown  above),  solving  NSP  also  solves  MCP.  D 

This  theorem  establishes  the  somewhat  surprising  result  that,  under  the  assumption  (MP),  NSP  is  NP- 
Hard  even  though  SI*  =  (NSP)2  and  SP  has  a  polynomial  algorithm.  Hence  complexity  is  not  preserved 
under  taking  “square  roots.” 

This  analysis  also  has  a  connection  with  a  different  area  of  complexity  research.  Hausman  and  Korte 
(1981)  have  investigated  the  relative  power  of  various  matroid  oracles  in  order  to  improve  understanding  of 
the  complexity  of  matroid  algorithms.  They  have  shown  that  a  girth  oracle  is  strictly  stronger  than  any 
other  matroid  oracle  studied.  It  was  previously  known  that  there  is  a  polynomial  girth  oracle  for  graphic 
matroids  (sec  Itai  and  liodch  (1978)),  but  no  polynomial  girth  oracles  have  been  discovered  for  any  more 
complicated  classes  of  matroids.  The  matroids  generated  by  the  columns  of  sparse  matrices  satisfying  (MP) 
are  transversal  matroids  (see  Welsh  (1976),  Section  7.3).  Transversal  matroids  arc  one  of  the  simplest  classes 
of  matroids  besides  the  class  of  graphic  matroids,  yet  Theorem  3.3.6  shows  that  it  is  extremely  unlikely  that 
there  is  a  polynomial  girth  oracle  for  transversal  matroids.  Perhaps  this  is  why  girth  oracles  arc  so  powerful. 


3.4.  Practical  Algorithms  for  SP 

We  now  consider  using  the  algorithms  of  Section  3.3  to  process  real  matrices.  The  behavior  of  the 
algorithms  when  A  docs  not  have  full  rank  and  when  it  docs  not  satisfy  (MP)  is  investigated.  Various 
modifications  to  the  algorithms  that  decrease  their  running  time  in  practice  are  also  discussed. 

3.4.1.  Processing  Rank-deficient  Matrices 

The  first  step  in  developing  a  more  practical  algorithm  than  PA  or  SA  Ls  to  drop  the  assumption  that 
rank  A  —  m.  The  object  of  solving  SP  is  U>  find  a  sparser  matrix  A  that  spans  the  same  row  space.  Therefore, 
when  A  is  rank-deficient  an  algorithm  should  select  a  row  basis  for  A,  delete  the  remaining  dependent  rows, 
and  use  PA  or  SA  to  make  the  row  basis  optimally  sparse.  The  next  theorem  shows  that  while  (MP)  still 
holds,  the  answer  obtained  is  independent  of  the  choice  of  basis. 
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Theorem  3.4.1:  Roth  PA  and  SA  produce  the  same  final  number  of  non-ieros  when  applied  to  any  maximum 
independent  subset  of  rows  of  A. 


Proof:  By  Theorem  3.3.4,  it  suffices  to  prove  the  theorem  for  PA.  Any  two  bases  are  connected  by  a  sequence 
of  row  swaps,  and  lienee  it  is  sufficient  to  consider  two  bases  that  differ  only  by  a  swap. 

Let  be  a  subset  of  rows  which  is  a  basis  for  A,  and  let  r  and  a  be  rows  with  r  C  B],  a  B j  such 
that  /?2  =  f?i  U  {  a  }  \  { t }  is  also  a  basis  for  A.  We  shall  show  that  PA  produces  the  same  final  number  of 
non-zeros  when  applied  to  Agt.  and  Ag2»-  Since  (MP)  is  still  assumed,  rank  A  =  M(A)  =  |Z?i|  =  |Z?*|. 

Let  »  6  B\  \  {  r  },  and  let  be  the  size  of  a  minimum  cover  of  the  *th  zero-section  in  Agkm,  k  =  1,2. 
By  (3.2.1)  it  suffices  to  show  that  =  l^.  Define  U ,  U  and  P,  P  so  that  they  partition  the  rows  and  columns 
of  the  ith  zero-section  of  Ag2»  as  before,  and  hence  U  =  U{  (of  Ag,k),  etc.  Recall  that  L\  =  f7 U  P  is  the 
standard  minimum  cover  of  the  t'h  zero-section  of  Aglk  by  lines,  so  that  fj  =  \L\\.  The  tth  zero-section  of 
Agl .  with  a  maximum  matching  M  and  row  a  adjoined  must  look  like 


0000000000000000  row  i 


P 


By  Fact  1  of  Theorem  3.3.5,  M  can  be  extended  to  a  perfect  matching  of  all  of  Ag,»  that  uses  the 
same  columns  as  M.  Hence  it  can  be  assumed  without  loss  of  generality  that  M  is  part  of  a  row-perfect 
matching  of  Agt,.  Thus  row  a  must  be  zero  in  columns  in  N  (the  unmatched  columns),  for  otherwise  fa 
could  be  trivially  extended  to  row  a,  contradicting  the  fact  that  By  is  a  basis. 

Define  K  to  be  the  columns  in  P  where  row  a  is  non-zero,  as  shown  above.  Let  fa  be  .M  restricted  to 
Ajjp.  Try  to  extend  fa  by  labelling  in  the  submatrix  « J/P-  Since  is  zero,  fa  cannot  be  so 

extended,  for  otherwise  fa  could  be  extended.  Define  the  submatrix  of  labelled  rows  and  columns  of  Ajjp  to 
bo  B;  clearly  It  C  (columns  of  H  },  and  by  the  properties  of  the  labelling  process,  B  must  be  square.  Note 
also  that  the  rows  of  II  must  be  zero  in  the  columns  of  7s  \  {columns  of  II},  for  otherwise  more  rows  and 
columns  would  have  been  labelled.  Thus  the  picture  has  now  become 


Section  3.4.2 


Processing  Matrices  without  (MP) 


55 


U 


Define  Li  =  LiU  {columns  of  B  }  \  {  rows  of  B}.  By  the  above  remarks,  Li  must  cover 
and  so  it  also  covers  A/j,,.  Note  that  ,/yjj  =  \L-i\  —  1 1.  Since  Li  is  a  cover  of  the  »th  zero-section  of  Ag1%, 
li  can  be  at  most  .  Now  repeat  the  argument,  reversing  the  roles  of  B\  and  Bi,  and  of  r  and  a,  obtaining 
li  <  lt.  Thus  1 1  =  li,  and  conse<iuentiy  PA  produces  the  same  result  on  row  »  with  cither  Bi  or  Bi. 

It  remains  only  to  show  that  PA  produces  the  same  final  number  of  non-zeros  for  r  in  A u,,  as  it  does 
for  a  in  A#,..  Note  that  row  r  might  have  a  different  starting  number  of  non-zeros  than  row  a.  It  is  only 
claimed  that  the  final  answers  have  the  same  number  of  non-zeros;  if,  say,  row  r  has  more  initial  non-zeros 
than  row  a,  PA  must  eliminate  more  non-zeros  from  Ay,,  than  from  A«a.  to  obtain  the  same  final  number 
of  non- zeros. 

Let  >1  be  a  perfect  matching  in  A/*,.,  and  adjoin  a  to  A#,..  Now  try  to  extend  M  in  Ah,^,},.  by 
labelling  as  above,  starting  at  a,  again  obtaining  a  square  submalrix  B  of  A/i1#  such  that  the  rows  of  B  are 
zero  outside  the  columns  of  B.  Thus  the  picture  must  look  like 


row  r 


row  a 

Since  a  can  be  exchanged  for  r  while  maintaining  a  matching  of  size  |/?i|,  r  must  be  among  the  rows  of  B. 
But  B  has  a  perfect  matching,  so  that  PA  can  clearly  use  all  the  rows  k  7^  r  in  B  to  eliminate  all  but  one 
non-zero  from  row  r,  and  this  use  of  rows  is  optimal.  By  reversing  roles  again,  the  transformed  row  a  in 
Ab,»  must  also  contain  only  one  non-zero.  Thus  PA  produces  the  same  result  using  either  B 1  or  Bi ■  0 

This  proof  shows,  however,  that  |t/*|  can  be  different  for  B\  and  Bi,  and  so  one  basis  might  be  best  in 
terms  of  requiring  the  least  amount  of  work.  Unfortunately,  we  know  of  no  way  to  determine  such  a  basis. 

3.4.2.  Processing  Matrices  without  (MP) 

The  next  step  towards  practicality  is  the  major  one  of  dropping  (MP).  Though  the  use  of  (Mi’)  as 
a  tool  to  derive  an  algorithm  is  quite  justified,  it  does  not  follow  that  (MP)  is  actually  satisfied  by  most 
real  matrices.  Indeed,  the  results  in  Section  3.5  show  that  none  of  the  real  matrices  tested  satisfies  (MP). 
Now  (MP)  is  formally  renounced  as  an  assumption.  To  simplify  matters  at  first,  assume  once  again  that 
rank  A  =  m. 

Without  (MP),  as  was  discussed  in  Section  3.3,  PA  is  unuseable  in  its  present  form.  To  reiterate,  it 
is  unuseable  because  PA  assumes  that  any  square  submatrix  selected  by  a  fixed  row-perfect  matching  is 
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non-singular,  which  is  no  longer  necessarily  true.  In  contrast,  SA  has  great  freedom  in  choosing  the  G,-.  The 
advantage  of  SA  over  PA  is  that  SA  can  choose  C,-  by  a  numerical  criterion  rather  than  a  combinatorial  one. 

To  set*  how  SA  chooses  Gi  in  practice,  first  note  that  since  A  (or  a  basis  of  A)  starts  out  with  full  row 
rank,  and  since  A  is  multiplied  by  a  sequence  of  non-singular  transformations  as  SA  progresses,  at  any  point 
in  the  execution  of  SA,  and  for  any  row  subset  U ,  rank  Au»  —  |£/|-  Thus  Ay.  must  always  have  some  column 
subset  G  such  that  Ay  a  is  non-singular.  At  the  appropriate  point  of  SA,  Ay,  is  passed  to  a  subroutine 
that  can  numerically  select  a  G  so  that  Aug  is  non-singular.  A  subroutine  that  can  perform  sparse  LU 
factorisations  of  rectangular  matrices  is  ideal.  In  the  experimental  implementation  described  in  Section  3.5, 
a  state-of-the-art  Harwell  black-box  subroutine  called  MA28  is  used  for  this  purpose  (see  DufT  (1977)). 

A  further  benefit  of  choosing  G  numerically  by  an  Z> //-factorization  is  that  it  neatly  combines  Steps  2 
and  3  of  ORA,'.  That  is,  since  the  factorisation  routine  selects  the  G  such  that  the  /^//-factorisation  of  Aug 
was  was  computed,  solving  for  X  at  Step  3  of  ORA,-  becomes  quite  easy.  Also,  a  good  factorisation  routine 
like  MA28  chooses  G  so  that  Aug  is  fairly  well-conditioned.  This  property  gives  some  assurance  that  the 
reduced  A  is  not  much  worse  conditioned  that  the  original  A. 

The  above  method  of  implementing  SA  enables  it  to  reliably  process  real  matrices.  When  SA  is  applied 
to  a  real  problem  A,  we  would  ideally  like  to  guarantee  that  it  reduces  the  number  of  non-zeros  of  A  at 
least  as  much  as  if  A  did  satisfy  (MP).  That  is,  for  a  given  sparsity  pattern  A,  there  is  a  well-defined 
reduction  in  non-zeros  r  possible  by  either  PA  or  SA,  independent  of  the  values  of  non-zero  entries  of  A, 
as  long  as  they  satisfy  (MP)  (indeed,  it  is  possible  to  run  either  PA  or  SA  on  a  sparsity  pattern  totally 
combinatorially,  without  doing  any  numerical  operations  whatsoever).  Not  satisfying  (MP)  means  that 
unexpected  cancellation  can  happen,  and  it  seems  that  such  cancellation  could  only  help.  Hence  it  should 
be  possible  to  show  that  at  least  r  non-zeros  are  eliminated. 

However,  proving  such  a  guarantee  is  somewhat  subtle.  Consider  the  full-rank  matrix 

/I  3  0  5  5\ 

A=  2  1  4  0  0. 

VO  3  0  5  5 ) 

The  Sequential  Algorithm  chooses  U\  —  (3),  and  could  choose  G i  =  (2).  The  associated  transformation 
unexpectedly  zeros  out  columns  4  and  5  of  row  t.  If  row  2  is  processed  using  the  new  row  1,  SA  chooses 
Z/j  =  {  1  }.  Rut  the  parallel  Z/2  =  0,  which  does  not  contain  U2  as  required  by  the  induction  hypothesis  of 
Theorem  3.3.4.  Since  the  performance  of  SA  depends  on  the  hypothesis,  it  can  not  be  guaranteed  that  SA 
eliminates  as  many  non-zeros  as  the  ideal.  (A  close  reading  of  the  proof  of  Theorem  3.3.4  reveals  that  this 
difficulty  can  arise  only  when  rank  Arz  <  M(Arz)  for  some  zero-section;  in  the  second  zero-section  of  this 
example,  rank  Arz  =  1  <  2  =  M{Arz)-) 

A  simple  trick  avoids  this  difficulty.  As  SA  executes,  at  each  step  it  can  recognize  where  non-zeros  are 
expected  to  occur  for  subsequent  steps.  Unexpected  cancellation  can  be  recognized  in  two  ways.  First,  if  the 
current  row  is  t  and  a,-y  7^  0,  j  $  Gi,  but  Sjy  =  0,  SA  has  a  lucky  hit.  Second,  if  o,-y  =  0  and  j  6  P(f/<)  \  Gi 
(the  expected  fill-in  columns),  but  «<y  =  0,  then  SA  has  a  lucky  non-fill-in.  When  unexpected  cancellation 
occurs,  SA  can  put  a  phantom  non-zero  in  that  entry  of  the  matrix  (a  zero  that  is  treated  as  if  it  were  a 
non-zero).  That  is,  subsequent  matchings  arc  performed  as  if  no  unexpected  cancellation  ever  took  place, 
although  SA  keeps  track  of  which  “non-zeros"  arc  really  zeros.  As  long  as  A  initially  has  full  row  rank,  the 
numerical  operations  can  never  create  a  dependence  among  the  rows.  Thus  SA  can  always  find  a  G  so  that 
Aug  is  non-singular,  even  with  phantom  non-zeros.  When  SA  is  modified  by  using  phantom  non-zeros,  it  is 
called  the  Safeguarded  Sequential  Algorithm  (SSA). 

Theorem  3.4.2:  SSA  eliminates  at  least  as  many  non-zeros  from  any  full-rank  matrix  A  as  it  would  if  A 
satisfied  (MP). 

Proof:  The  proof  of  Theorem  3.3.4  becomes  valid  once  again  with  SSA,  which  yields  the  guarantee.  D 

Of  course,  it  is  possible  to  apply  SA  on  real  problems  without  safeguarding  by  keeping  phantom  non¬ 
zeros,  but  the  guarantee  of  Theorem  3.4.2  is  lost.  Letting  SA  “know”  about  more  zeros  by  removing  the 
safeguard  might  give  it  freedom  to  produce  greater  reductions,  but  the  experiments  in  Section  3.5  show  no 
clear  advantage  for  cither  SA  or  SSA. 
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Now  wc  drop  the  assumption  that  rank/4  =  rn  once  again. 
3.4.1  is  no  longer  true: 


The  following  example  shows  that  Theorem 


1  l  1  0  0\ 

-1  0  0-1  1 

00-110 
0  -1  0  0-1/ 


Though  iW(  A)  —  4,  rank  >1  =  3,  since  the  sum  of  the  rows  is  zero.  If  the  first  row  is  dropped  to  get  a  full  rank 
submatrix,  then  neither  SA  nor  SSA  eliminates  any  non-zeros  from  A,  for  a  final  total  of  7  non-zeros.  If  the 
last  row  is  dropped,  then  once  again  neither  SA  nor  SSA  reduces  any  nou-zeros  from  A,  for  a  final  total  of 
8  non- zeros.  We  know  of  no  way  for  choosing  a  basis  of  A  that  maximizes  the  final  number  of  non-zeros. 
We  conjecture  that  the  difference  between  bases  is  negligibly  small  in  practice,  so  that  this  issue  would  be  a 
major  concern.  The  truth  of  this  conjecture  has  not  yet  been  empirically  tested. 


3.4.3.  Miscellaneous  Modifications  to  SA 

The  first  step  in  any  implementation  of  SA  or  SSA  must  be  the  determination  of  a  row  basis  for  A. 
As  discussed  above,  a  routine  such  as  MA28  is  ideal  for  this  task.  If  a  row  basis  is  determined  through  an 
/.{/-factorization  as  in  MA28,  an  important  side  benefit  can  be  realized. 

Note  that  an  LU- factorization  picks  out  a  square  submatrix  Auc  of  A  such  that  |/f|  =  \C\  = 
rank  Arc  —  rank  A.  As  SA  or  SSA  processes  A«.,  the  partially  transformed  Arc  remains  non-singular 
just  as  Ar .  does.  Thus,  when  the  algorithm  needs  to  search  Ay,  for  a  non-singular  submatrix  Aug,  it  can 
restrict  its  search  to  Auc •  Since  \C'  is  usually  much  smaller  than  n,  such  a  restriction  can  lead  to  large 
saving?  in  time  if  Aye  instead  of  Ay  is  factored  to  find  G.  Of  course,  restricting  the  columns  in  whicli  G, 
can  occur  also  restricts  the  freedom  of  the  factorization  routine  to  choose  a  well-conditioned  Gx,  but  since 
Arc  is  chosen  to  be  reasonably  well-conditioned,  the  restricted  (7,  should  not  be  much  worse  conditioned 
than  the  unrestricted  (7,-.  This  modification  is  called  the  restricted  column  option.  The  results  in  Section 
3.5  report  on  its  performance. 

There  is  another  modification  that  can  speed  up  the  combinatorial  parts  of  the  algorithms.  All  of  the 
combinatorial  effort  of  the  algorithms  involves  computing  maximum  cardinality  matchings  in  zero-sections. 
Instead  of  starting  each  such  matching  from  the  empty  matching,  it  is  faster  to  initialize  them  as  follows. 
Start  by  finding  a  one-time  fixed  maximum  matching  for  A,  call  it  M.  Then  initialize  the  matching  in  the 
»th  zero-section  with  M  restricted  to  the  columns  of  the  zero-section.  An  entry  in  M  might  be  eliminated 
at  some  point  during  execution  of  the  algorithm.  If  this  should  happen,  a  single  augmentation  of  M  before 
the  next  iteration  restores  M  to  a  maximum  matching.  This  method  of  initialization  is  called  warm-start 
matching. 

With  warm-start  matching  a  good  bound  on  the  combinatorial  running  time  of  SA  and  SSA  can  be 
derived.  Let  v  be  the  number  of  non-zeros  of  A,  which  can  be  assumed  to  be  greater  than  n.  Finding  the 
original  X  takes  0(y/m  +  nu)  time  (see  Papadiinitriou  and  Stieglitz  (1982),  Theorem  10.3).  Copying  the 
columns  of  M.  into  a  starting  solution  for  the  *th  zero-section  takes  0(m)  time,  and  hence  copying  takes 
0(m2)  overall.  After  copying  M  into  the  *'th  zero-section,  each  remaining  unmatched  row  in  the  zero-section 
matches  to  a  column  outside  the  zero-section  under  M.  The  number  of  initially  unmatched  rows  in  the  tth 
zero-section  is  at  most  the  number  of  non-zeros  in  row  »  of  A,  and  so  the  total  number  of  unmatched  rows 
is  at  most  u.  Each  unmatched  row  can  lead  to  at  most  one  augmentation  of  a  matching,  so  that  the  total 
number  of  zero-section  augmentations  is  at  most  v.  An  augmentation  is  an  0( u)  process,  making  the  total 
time  spent  in  zero-section  matching  0(t/2).  Finally,  each  entry  of  M  can  be  eliminated  at  most  once  (when 
its  row  is  processed);  therefore,  a  single  augmentation  of  W  might  be  needed  at  most  m  times.  Again  each 
augmentation  takes  at  most  0(u),  for  total  time  0{mi/)  spent  repairing  X.  Therefore  the  total  time  bound 
for  the  combinatorial  running  time  of  SA  and  SSA  with  warm-start  matching  is  0(u2). 

It  is  more  difficult  to  obtain  an  accurate  bound  on  the  numerical  running  time  of  SA  and  SSA.  In  the 
worst  case,  even  with  the  restricted  column  option,  the  algorithms  have  to  factor  ami  solve  a  linear  system  of 
size  0(m)  for  each  of  m  rows  (plus  once  at  the  beginning  to  obtain  a  basis  for  A).  Since  factoring  and  solving 
one  such  system  is  bounded  by  0(m 3),  an  overall  numerical  bound  is  0(m*).  However,  since  A  is  sparse,  the 
systems  to  be  solved  arc  also  sparse.  An  efficiently  implemented  sparse  equation  solver  like  MA28  tends  to 
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solves  such  systems  in  time  O(u)  (see  DufT  (1977),  Table  3),  which  would  give  the  much  better  overall  bound 
of  0{mi>)  for  numerical  operations. 

As  a  final  observation  on  applying  the  algorithms  to  practical  problems,  note  that  linear  constraints  are 
usually  not  presented  as  a  system  of  equations  as  in  (3.1.3),  but  as  a  mixture  of  inequalities  and  equations. 
Inequalities  can  be  converted  to  equalities  by  adding  a  slack  variable,  so  that  the  constraints  arc  then  in  a 
form  which  is  suitable  for  the  algorithms.  However,  there  is  another  modification  that  can  be  used  to  save 
time  when  inequalities  are  present. 

First,  note  that  an  inequality  row  i  can  never  be  used  by  another  row,  because  its  slack  column  (a  unit 
vector),  is  in  the  zero-section  of  every  other  row.  It  can  be  assumed  without  loss  of  generality  that  the 
maximum  matching  in  each  such  zero- section  includes  the  slack  entry.  Since  all  other  entries  in  the  slack 
column  are  zero,  there  is  no  way  for  row  t  to  be  labelled  during  the  matching  in  that  zero-section.  Thus  row 
i  is  in  the  U  of  no  other  row.  (But  row  t  can  itself  be  reduced  by  other  rows  during  the  algorithm.)  Indeed, 
whenever  a  row  i  has  a  non-zero  in  a  column  which  is  a  unit  vector,  row  »  is  never  used  by  any  other  row. 
Thus,  if  A  contains  an  embedded  identity  matrix,  the  algorithm  does  not  reduce  A  at  all. 

Given  that  the  slack  columns  do  not  participate  in  the  matchings,  it  is  more  efficient  to  create  a 
single  “phantom”  column  (column  0),  instead  of  many  slack  columns.  In  using  warm-start  matching,  all 
the  inequality  rows  can  be  permanently  matched  to  column  0  before  finding  M.  If  column  0  is  artificially 
included  among  the  columns  of  every  zero-section,  all  the  inequality  rows  initially  match  to  column  0  in 
the  zero-section  since  M  is  used  for  an  initial  matching.  Since  column  0  contains  no  non-zeros,  it  is  never 
labelled  during  computation  of  a  matching  in  a  zero-section.  Therefore,  none  of  the  inequality  rows  is  ever 
labelled  either,  so  that  the  inequality  rows  arc  automatically  not  used.  Furthermore,  since  the  inequality 
rows  stay  matched  throughout  the  process,  this  strategy  effectively  reduces  the  size  of  the  matching  problem 
in  each  zero-section. 

More  importantly,  the  inequality  rows  can  be  excluded  from  A  in  determining  its  rank  at  the  beginning, 
because  inequality  rows  always  have  full  rank  and  arc  never  used  anyway.  Therefore,  the  square  full  row 
rank  submatrix  Arc  obtained  from  the  initial  /.{/-factorization  excluding  the  inequality  rows  is  suitable 
for  using  the  restricted  column  option,  and  C  is  even  smaller.  These  strategics  for  treating  inequalities  are 
called  phantom  slacks.  By  using  phantom  slacks  numerical  execution  time  can  be  even  further  reduced. 

The  disadvantage  of  inequalities  is  that,  by  decreasing  the  number  of  rows  available  for  use,  they  lead  to 
smaller  reductions  of  non-zeros.  The  extreme  case  is  that  when  all  constraints  arc  inequalities,  the  algorithm 
is  not  able  to  eliminate  any  non-zeros.  A  general  rule  of  thumb  for  using  SA  and  SSA  Is  that  the  higher  the 
proportion  of  equality  constraints,  the  better. 

Finally,  potential  implementors  are  reminded  that  as  the  rows  of  A  are  transformed,  the  same  transfor¬ 
mations  must  be  applied  to  the  right-hand  sidc(s)  b.  However,  any  RANGES  or  BOUNDS  (see  Murtagh  (1981), 
Section  9.2)  do  not  need  to  be  changed. 

3.5.  Computational  Results 

In  this  section  we  shall  first  describe  the  implementation  of  an  experimental  version  of  the  algorithms 
of  Section  3.4 .  Some  preliminary  computational  results  from  the  implementation  arc  then  discussed. 

3.5.1.  An  Experimental  Implementation  of  the  Algorithm 

The  experimental  implementation  of  SA  and  SSA  is  a  FORTRAN  program  called  SPARSER.  The  program 
reads  A  in  industry-standard  MPS  format  (see,  e.g.,  Murtagh  (1981),  Section  9.2),  processes  it  by  one  of 
several  variant  algorithms  (depending  on  its  input  parameters),  and  outputs  the  reduced  A  in  MPS  format  if 
desired.  The  MPS  input  routine  is  a  modified  version  of  the  routine  used  by  MINOS  (see  Saunders  (1977)), 
winch  uses  Brent’s  (1973)  version  of  double  hashing  to  reduce  time  spent  in  row  look-up. 

The  two  biggest  tasks  for  SPARSER  arc  computing  maximum  cardinality  matchings  in  various  sub¬ 
matrices  of  A,  and  computing  the  //{/-factors  of  various  rectangular  submatrices  of  A  (and  solving  the 
resulting  square  subsystems).  The  matching  is  performed  by  a  modified  version  of  a  depth-first  search 
look-ahead  technique,  as  described  in  Gustavson  (1976).  Though  this  algorithm  has  poorer  worst-case  per¬ 
formance  than  the  Ilopcroft  anil  Karp  (1973)  algorithm,  creators  of  sparse  matrix  software  have  empirically 
observed  that  its  average  performance  on  typical  real  problems  is  better  than  other  algorithms.  The  lack  of 
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a  realistic  random  model  of  sparse  matrices  has  prevented  any  attempt  to  provide  theoretical  support  for 
this  observation. 

As  mentioned  in  Section  3.4,  the  numerical  tasks  in  SPARSER  are  handled  by  M  A2H,  a  package  of  sparse 
matrix  //(/-factorization  and  linear  equation-solving  routines  written  by  Duff  at  Harwell  (see  Duff  (1977)). 
The  advantage  of  using  M  A28  for  SPARSER  is  that  it  can  process  rectangular  matrices.  This  feature  is 
important  because  the  only  practical  way  to  choose  column  subset  (1  from  Ay.  (or  from  Aye  if  using  the 
restricted  column  option  of  Section  3.4)  is  to  factor  all  of  Ay.,  which  lets  (1  be  dynamically  specified  by 
the  choice  of  pivot  columns.  MA28  uses  a  hybrid  method  for  choosing  its  pivots  based  on  a  pivot  stability 
parameter  U  which  controls  the  trade-olT  between  stability  and  sparsity.  Setting  U  —  1.0  makes  MA2s  chose 
pivots  purely  on  the  basis  of  stability;  choosing  U  =  0.0  makes  it  choose  pivots  purely  on  the  Markowitz 
sparsity  criterion  (see  Markowitz  (1957)).  The  value  U  =  0.1  is  recommended  in  Duff  (1977)  and  was  used 
in  all  tests  reported  herein.  The  value  of  U  can  be  set  by  an  input  to  SPARSER. 

MA28  has  other  parameters  of  interest  to  designers  of  SP  algorithms.  When  MA28  parameter  LBLOCK  is 
.TRUE.,  MA28  block-triangularizos  its  input  (finds  the  DM  decomposition  of  a  square  submatrix)  before  it  is 
factored.  When  the  chosen  submatrix  decomposes  into  relatively  small  blocks,  time  can  be  saved  since  it  is 
cheaper  to  factor  many  small  matrices  rather  than  one  big  one.  However,  decomposing  a  submalrix  can  be 
dangerous  when  the  matrix  is  rectangular. 

For  example,  within  SPARSER,  the  matrix  Ay,  (or  Aye)  (which  is  known  to  have  rank  \U\)  is  input  to 
MA2M.  When  MA28  performs  a  block-triangularization,  it  first  finds  a  maximum  matching  M  in  the  matrix, 
which  must  also  be  of  size  jT/j.  The  difficulty  is  that  MA28  forces  itself  to  factor  the  size  U  square  submatrix 
B  induced  by  M,  and  it  can  happen  that  rank  B  <  \U\.  This  possibility  once  again  illustrates  the  reason 
why  SA  is  preferred  over  PA:  real  matrices  do  not  always  have  full  numerical  rank  when  they  have  a  perfect 
maximum  matching.  A  retry  function  within  SPARSER  overcomes  this  difficulty  by  deleting  the  column 
that  MA28  indicates  is  dependent  and  re-factoring,  but  doing  this  slows  SPARSER  down.  Thus,  although 
SPARSER  includes  a  way  to  change  LBLOCK,  it  is  strongly  recommended  that  block-triangularization  be 
disabled. 

Another  MA28  parameter  of  interest  controls  the  solution  routine  and  is  called  MTYPE.  If  the  input  matrix 
is  B  and  the  right-hand  side  is  fc,  then  MTYPE  controls  whether  Bx  —  b  or  xB  =  b  is  the  system  to  be  solved. 
Since  SPARSER  needs  the  solution  of  kyAyy  —  -Aty,  it  would  seem  that  MTYPE  should  be  defined  so  that 
the  second  option  is  always  taken.  However,  SPARSER  subsumes  MTYPE  into  a  parameter  of  its  own  that 
controls  which  one  of  Ay,  or  Ay.  is  input  to  the  factor  routine,  and  that  selects  the  appropriate  value  of 
MTYPE  accordingly.  This  option  allows  experimentation  to  determine  whether  it  is  faster  to  factor  rectangular 
matrices  with  the  smaller  or  the  larger  dimension  first  within  SPARSER. 

Two  parameters  particular  to  SPARSER  are  relevant  here.  The  first  specifies  which  algorithm  is  used 
to  process  A.  When  describing  algorithms,  “combinatorial"  means  that  an  algorithm  is  applied  formally 
to  the  sparsity  pattern  of  A,  without  performing  any  numerical  operations.  A  combinatorial  algorithm  is 
of  use  only  as  a  fast  way  to  determine  the  performance  of  an  algorithm  on  a  given  matrix;  since  only  the 
sparsity  pattern  of  the  reduced  matrix  is  correct,  not  the  numerical  values,  the  reduced  matrix  cannot  serve 
any  further  useful  purpose.  By  contrast,  “numerical"  means  that  numerical  operations  arc  performed,  so 
that  the  reduced  A  is  equivalent  to  the  input  A. 

With  this  understanding  of  terms,  SPARSER  allows  four  algorithmic  options:  combinatorial  PA,  com¬ 
binatorial  SA,  numerical  SSA,  and  numerical  SA.  The  first  two  options  arc  mainly  used  verify  the  correctness 
of  SPARSER  as  it  evolves;  by  Theorem  3.3.4  they  should  always  give  the  same  final  number  of  non-zeros. 
They  arc  also  useful  for  quickly  checking  a  new  matrix,  as  noted  above.  The  third  alternative  is  SSA  as 
described  in  Section  3.4,  with  the  guarantee  of  Theorem  3.4.2.  The  fourth  alternative  is  SA  without  the 
phantom  non-zeros  safeguard  of  SSA  that  allows  Theorem  3.4.2  to  he  applied. 

All  versions  of  the  algorithm  use  warm-start  matching  and  phantom  slacks  as  described  in  Section  3.4, 
but  the  restricted  column  option  is  a  uscr-sclcctablc  option.  The  trade-off  between  decreased  stability  and 
smaller  execution  time  with  the  restricted  column  option  can  then  be  tested. 

Four  parameters  govern  how  A  is  processed  by  SPARSER,  rarh  with  two  values:  SA  can  he  run  with 
safeguarding  or  not,  block-triangularization  or  not,  factoring  A  or  AT,  and  using  the  restricted  column 
option  or  not.  Thus  there  arc  16  variant  algorithms  available  through  SPARSER. 
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3.5.2.  Testing  the  Variations  of  SPARSER 

Each  of  the  16  variations  of  SPARSER  was  used  to  process  a  linear  programming  matrix  called  BANDM. 
These  runs  held  all  other  minor  parameters  fixed,  and  in  particular  set  all  print  flags  to  their  lowest  level  and 
disabled  MPS  output  of  the  reduced  matrix,  to  allow  the  execution  time  to  reflect  only  the  basic  processing. 
The  matrix  of  BANDM  has  m  =  305,  n  =  472  and  contains  2494  non-zeros,  a  density  of  1.73%.  It  has  305 
equality  rows.  100%  of  the  total.  The  combinatorial  number  of  non-zeros  that  can  be  eliminated  from  BANDM 
(the  guarantee  of  Theorem  3.4.2)  is  633,  which  is  22.8%  of  the  total  non-zeros  (an  impressive  figure). 

The  results  of  these  runs  arc  displayed  in  Table  3.5.1.  All  testing  was  performed  while  running  SPARSER 
interactively  on  VM  on  an  IBM  3081  at  the  Stanford  I. incar  Accelerator  Center.  The  “total  time”  is  in  CPU 
seconds  while  running  SPARSER  interactively  on  VM.  The  “total  gain"  column  adds  the  lucky  hits  and 
lucky  non-fill-ins  (each  one  a  proof  that  (MP)  is  not  satisfied)  to  the  guaranteed  combinatorial  gain  of  633 
non-zeros.  Column  “max  grow"  gives  the  maximum  value  over  all  rows  of  an  MA28  output  parameter  called 
GROW.  The  value  of  GROW  estimates  the  extent  to  which  the  numerical  operations  on  a  row  cause  the  entries 
of  A  to  “blow  up"  numerically,  and  so  provides  an  indirect  measure  of  the  stability  of  the  reduced  matrix 
relative  to  the  original  matrix  (see  Duff  (1977),  pp.  17-18).  Column  “total  used”  gives  £T|(/,|,  an  indication 
of  the  sizes  of  the  linear  systems  that  were  solved  within  SPARSER,  and  “max  used”  gives  max,  ?/,,  which 
helps  to  indicate  whether  most  of  the  gain  is  coming  front  a  few  rows  (“max  used"  nearly  as  big  as  total 
used)  or  is  spread  out  ("max  used"  very  small).  Finally,  "no.  rematchings”  reports  how  many  times  out  of 
the  305  equality  rows  an  entry  of  the  fixed  matching  was  hit,  necessitating  repair. 

Starting  "rout  the  less  important  conclusions  that  can  be  drawn  from  Table  3.5.1,  there  seems  to  be 
little  significant  difference  among  the  16  variations  of  the  algorithm  on  any  of  the  last  five  columns,  with  two 
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possible  exceptions.  First,  “max  grow”  varies  from  1.55  to  1 0-1,  which  might  indicate  that  some  variations 
are  inherently  more  stable  than  others.  However,  our  experience  has  been  that  a  value  of  in'!  for  “max  grow” 
is  no  cause  for  concern,  and  we  conjecture  that  the  lower  values  of  “max  grow”  resulted  more  from  luck  than 
any  fundamental  differences  among  variations.  Second,  there  appears  to  be  a  slight  trend  for  t.he  number  of 
rematchings  to  increase  when  using  block  triangularization.  However,  this  possible  deficiency  pales  beside 
the  other  Haws  of  block-triangularization,  as  shown  by  the  results  below. 

There  seems  to  be  only  random  variation  in  total  gain  (from  641  to  649),  the  stated  objective.  A  pattern 
might  possibly  emerge  if  the  16  variations  were  run  on  a  different  matrix  for  which  lucky  gains  were  a  bigger 
part  of  total  gain  than  for  BANDM.  It  is  shown  later  that  BANDM  has  an  atypically  small  amount  of  lucky  gain. 
This  issue  will  be  tested  further  in  the  future.  The  only  remaining  criterion  by  whicli  to  judge  the  variations 
is  their  execution  times.  There  is  a  definite  spread  of  time;  the  longest  time  is  3.32  seconds,  almost  twice 
the  shortest,  1 .75  seconds. 

Safeguarding  or  not  makes  very  little  diflcrcncc  in  time  according  to  Table  3.5.1,  though  again  this  might 
change  with  a  matrix  with  more  lucky  gain. 

Factoring  submatriccs  of  A  in  their  normal  (as  opposed  to  transposed)  form  appears  to  be  faster  for  this 
application,  despite  the  opposite  advice  of  the  author  of  MA28.  That  is,  factoring  submatriccs  of  BANDM  with 
the  smaller  (row)  dimension  first  was  faster,  though  Duff  (1977,  p.  28)  suggests  putting  the  larger  dimension 
first.  The  apparent  exception  to  this  observation  is  that  without  the  reduced  column  option  and  with  block- 
triangularization,  in  which  case  factoring  the  transpose  is  fasler  (sec  Table  3.5.1).  This  anomaly  seems  to  be 
due  to  the  fact  that  in  both  of  the  cases  when  submatriccs  were  normally  factored,  block-triangularization 
caused  difficulties  by  selecting  singular  matrices  four  times.  The  retry  routine  then  caused  four  extra  linear 
systems  to  be  solved,  tipping  the  time  balance  to  factoring  the  transpose,  which  had  no  bad  luck  with  block 
triangularization. 

Besides  its  other  flaws  mentioned  above,  using  block-triangularization  on  BANDM  makes  STAItSKit  run 
more  slowly.  This  result  is  initially  surprising,  since  block-triangularization  is  supposed  to  speed  up  solving 
equations.  An  explanation  for  this  behavior  can  be  found  in  the  "total  used”  column  of  Table  3.5.1.  Since 
BANDM  has  305  rows,  and  “total  used"  is  roughly  755  rows,  an  average  of  less  than  2.5  other  rows  arc 
used  per  row  of  BANDM.  Thus  the  linear  systems  passed  to  MA28  are  already  quite  small,  so  that  the  block 
triangularization  code  adds  an  overhead  greater  than  any  possible  savings.  Kvcri  the  supposed  virtue  of  block 
triangularization  is  a  [law,  therefore  its  use  in  STARS  It  It  is  discouraged  more  severely  than  before. 

Finally,  using  the  restricted  column  option  leads  to  a  big  decrease  in  time  (which  is  scarcely  any  surprise). 
Without  it,  M  A 28  is  passed  a  matrix  whose  largest  dimension  is  n  =  472.  Using  the  restricted  column  option, 
the  largest  dimension  drops  to  the  rauk  of  the  equality  rows,  which  is  305  for  BANDM,  which  is  ;r  decrease 
of  35.4%.  Since  we  suspect  that  numerical  operations  dominate  STARSEIt’s  processing  time,  such  a  large 
decrease  is  bound  to  have  a  big  effect  on  running  time. 

With  these  results  in  mind,  the  rest  of  the  tests  were  conducted  with  no  block  triangularization,  factoring 
A  in  normal  form,  using  the  restricted  column  option,  and  using  safeguarding.  The  last  choice  was  made 
I  only  because  it  is  better  to  be  safe  than  sorry. 

3.5.3.  Testing  SPARSER  on  Real  Matrices 

The  next  objective  in  testing  SI’A Its  It  It  was  to  apply  it  to  a  variety  of  real  problems  to  see  how  well 
it  does  in  practice.  We  obtained  23  linear  programming  problems  from  Harlan  Crowder  of  the  IBM  T.  J. 
I  Watson  Research  l.aboratory  in  order  to  investigate  the  performance  of  STARS  Kit.  They  were  selected  solely 

on  the  basis  of  having  a  high  proportion  of  equality  constraints,  and  range  in  size  from  AFIRO  (which  is 
27  X  32  with  83  non-zeros)  to  AIRFSTAR  (which  is  31 1  X  3637  witli  10,513  non-zeros). 

The  results  of  running  STAltSltlt  on  these  23  problems  are  displayed  in  Table  3.5.2.  The  “rows”  and 
“columns”  figures  quoted  include  only  those  relevant  to  STARS  Kit  (t.c.,  objective  rows  and  right-hand 
shies  are  excluded),  and  “non-zeros"  counts  only  those  non- zeros  in  relevant  rows  and  columns.  The  "% 
|  eq.  rows"  reports  what  percentage  of  the  relevant  rows  are  equalities;  recall  from  Section  3.4  that  this 

figure  is  potentially  important  in  determining  the  performance  of  STARS  It  It.  Three  of  the  matrices  have 
rank-deficiencies  that  effectively  reduce  the  percent  of  equality  rows  in  Table  3.5.2  (BOAC  from  41.2  to  36.8; 
CRACP81  from  62.2  to  62.0;  BRANDY  from  75.5  to  72.0);  the  higher  figure  is  used  in  Table  3.5.1  since  it  can  be 
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Name 

#  «r 
flows 

#  of 

Cols. 

#  of 
Non-Zs. 

%  Eq. 
Rows 

Orig. 
%  Den. 

%  Red. 
Non-Z. 

l.ucky 

Comh. 

Total 

Time 

Max. 

Grow 

AFIUO 

27 

32 

83 

29.6 

9.61 

0.00 

.08 

IIOAC 

313 

298 

1659 

41.2 

1.78 

0.00 

1.97 

— 

PH5SOS2 

271 

1426 

2648 

82.7 

0.69 

0.00 

1.15 

— 

WEYERIISR 

41 

107 

706 

97.6 

16.09 

.85 

0.20 

.23 

2.82 

STOItKSOS 

68 

590 

6.94 

2.44 

.58 

.84 

829 

ADI.ITT1.H 

56 

97 

383 

26.8 

7.05 

3.39 

.08 

.23 

45 

FRK.SNO 

208 

316 

1791 

tii 

2.72 

3.46 

.03 

.55 

10.6 

L84MAV 

113 

1995 

9126 

99.1 

4.05 

4.95 

3.07 

5.62 

0.0 

1,1521) 

93 

1550 

9862 

98.9 

6.84 

7.16 

6.76 

2.28 

0.0 

L21I.AV 

108 

1939 

8779 

99.1 

4.19 

7.34 

5.08 

2.32 

1.0 

CAPRI 

271 

353 

1767 

52.4 

1.85 

8.09 

.01 

.89 

6.13 

CKACPHI 

M3 

572 

4158 

62.2 

5.08 

8.37 

2.87 

1.77 

0.0 

SHAIt  14211 

96 

79 

694 

13.5 

9.15 

8.50 

.23 

.25 

65 

1,27 1,  AV 

146 

2655 

11,203 

99.3 

2.80 

8.84 

5.78 

3.99 

1.0  j 

14226 

223 

.  ... 

282 

2578 

14.8 

4.10 

9.00 

1.97 

.58 

1.0 

SflAREIB 

!  18 

225 

1151 

75.4 

4.37 

13.47 

.12 

.55 

1810 ' 

BRANDY 

220 

— 

249 

— 

2148 

75.5 

3.92 

14.15 

.30 

.95 

45.5 

AIRESTAR 

311 

3637 

10,513 

99.0 

.93 

14.61 

21.26 

335 

1.0 

L152I.AV 

97 

1989 

9922 

99.0 

5.14 

16.11 

15.82 

2.46 

1.0 

UMt, 

85 

1086 

4677 

98.8 

5.07 

17.96 

9.12 

1.50 

1.0 

I.94MAV 

93 

1750 

7294 

98.9 

4.48 

21.02 

15.85 

5.49 

0.0 

BANDM 

305 

472 

2494 

100.0 

1.73 

25.78 

.02 

2.04 

194 

BEACONED 

173 

262 

3375 

80.9 

7.45 

64.06 

.80 

1.05 

136 

Table  3.5.2 


determiner!  without  running  SPARSER.  The  “orig.  density"  is  the  density  of  the  original  matrix,  based  on 
relevant  non-zeros,  relevant  rows  and  relevant  columns. 

The  linear  programs  in  Table  3.5.2  are  listed  in  increasing  order  of  “%  deer.  in  non-7.cros,”  which  is 
the  total  reduction  in  non-zeros  achieved  by  SPA  USE  It  (both  combinatorial  and  lucky)  as  a  percentage  of 
relevant  non- zeros.  It  ranges  from  zero  for  the  lirst  three  matrices,  to  an  astounding  01.06%  reduction  for 
BEACONFD.  The  “  "  col  umn  lists  the  relative  contribution  of  combinatorial  gains  and  lucky  gains  to  the 

total  gain.  I' or  example,  BANDM  had  a  combinatorial  gain  of  f>33  non-zeros  and  a  lucky  gain  of  10  non-zeias, 
for  a  ratio  of  10/633  .02,  the  second-smallest  value  listed.  The  "total  time”  .and  “max  grow”  columns 

have  the  same  meaning  as  in  ’fable  3.5-1.  The  discrepancy  in  time  for  BANDM  between  Tables  3.5.1  and  3.5.2 
is  due  to  the  fact  that  the  two  sets  of  runs  which  are  summarized  in  the  tables  were  done  several  months 
apart.  Thus  times  are  comparable  within  tables  but  not  between  tables. 

The  runs  summarized  in  Table  .3.5.2  produced  other  interesting  statistics.  First,  the  total  number  of 
rematcliings  over  aP  23  linear  programs  was  only  11,  with  BEACONFD  alone  accounting  for  23.  Thus,  having 
to  repair  the  lixed  matching  in  warm-start  matching  does  not  result  in  large  additional  overhead.  On  a 
related  note,  when  the  bound  on  the  combinatorial  running  time  of  SPARSER  with  warm  start  matching 
was  computet!  in  Section  3.1,  a  key  fact  that  was  used  is  that  the  number  of  unmatched  rows  (after  copying 
the  fixed  matching)  is  bounded  by  the  number  of  non-zeros.  In  practice,  the  number  of  unmatched  rows 
averaged  out  as  less  than  1 2%>  of  the  non-zeros. 
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The  total  used  for  each  linear  program  reveals  that  the  average  |//(|  for  the  'ill  problems  w.s  only  1.06 
Comparing  the  “combinatorial  gain”  and  “total  used”  figures  shows  that  each  used  row  leads  to  a  gain  of 
.92  non-zeros  on  average.  Recall  that,  by  Theorem  .‘{.2.2,  each  used  row  can  make  a  combinatorial  gain  of  at 
most  one  non-zero.  Although  saved  and  fill-in  columns  must  be  taken  into  account  to  derive  the  algorithm, 
in  these  runs  they  are  rare  events. 

Many  of  the  “lucky/combinatorial"  figures  may  seem  surprisingly  high,  but  examining  “max  used”  versus 
“total  used”  reveals  the  explanation.  For  seven  of  the  eight  linear  programs  With  “lucky /combinatorial”  >  3 
(all  except  for  AIRFSTAR,  where  “max  used"  is  4),  “max  used”  is  equal  or  nearly  equal  to  “total  used”.  This 
indicates  that  almost  all  of  the  total  gain  is  achieved  at  one  row.  All  seven  of  these  lucky  linear  programs 
(those  whose  name  starts  with  L)  have  one  nearly  dense  row.  The  dense  row  is  the  only  one  with  a  non-empty 
U„  and  the  matrices  are  structured  so  that  cancelling  out  some  of  the  non-zeros  in  the  dense  row  also  luckily 
cancels  out  nearly  all  the  rest. 

A  high  value  of  “lucky/combinatorial”  may  be  caused  by  other  special  structures  as  well.  When  using 
SPARSER  in  practice,  a  high  value  of  “lucky/combinatorial”  could  indicate  that  a  more  specialized  method 
might  be  more  appropriate  than  SPARSER.  Despite  having  no  provision  to  exploit  such  structure,  SPARSER 
achieved  creditable  reductions  on  the  lucky  matrices  anyway. 

The  results  in  Table  3.5.2  show  that  SPARSER  can  significantly  reduce  many  matrices.  The  degree  of 
reduction  does  not  appear  to  be  predictable  from  percentages  of  equalities  or  of  density  The  running  time  of 
SPARSER  seems  to  be  quite  modest.  Its  large  value  on  the  lucky  matrices  is  dominated  by  the  time  needed 
to  determine  the  rank  of  the  equality  rows.  Since  the  lucky  matrices  have  an  unusually  large  number  of 
columns  relative  to  rows,  MA28  is  required  to  factor  an  apparently  huge  matrix  Choosing  to  factor  the 
transpose  of  such  matrices  might  prove  to  be  faster.  The  “max  growth”  in  the  reduced  matrices  was  quite 
small  for  most  of  the  linear  programs. 

3.5.4.  Optimizing  Reduced  Matrices 

lincouraged  by  the  success  of  SPARSER  in  reducing  this  set  of  matrices,  the  15  problems  with  reductions 
of  at  least  5%  (the  ones  below  the  heavy  line  in  Table  3.5.2)  were  tested  in  comparative  optimization  runs. 
The  optimization  program  MINOS  (sec  Saunders  (1977))  was  chosen  for  these  lost  runs.  MINOS  is  a  high- 
quality  transportable  FORTRAN  routine  for  solving  problems  or  the  type  (3.1.1).  It  uses  state-of-the-art 
sparse  matrix  techniques  and  is  in  daily  use  on  the  SLAC  computer  for  solving  a  large  energy  model  linear 
program  (see  Dantzig  et  al.  (1981))  Rather  than  having  SPARSER  pass  an  internal  representation  of  the 
reduced  matrix  to  MINOS,  SPARSER  output  an  Iff  file  of  the  reduced  matrix  which  was  then  used  as  the 
input  for  MINOS. 

It  is  not  easy  to  compare  the  time  used  by  MINOS  to  optimize  an  original  versus  a  reduced  problem.  The 
Simplex  Algorithm  follows  the  same  pivot  sequence  on  both  A  and  T A  if  T  is  non-singular,  except  possibly 
in  Phase  I.  The  reason  is  that  when  Phase  1  adds  artificial  variables  to  A  and  T A,  it  obtains  (/I  /)  and 

(T A  /),  which  arc  no  longer  equivalent.  Thus  Phase  1  can  follow  a  different  pivot  sequence  on  the  reduced 
problem  than  on  the  original  problem,  which  would  result  in  different  initial  feasible  bases  for  Phase  II.  The 
overall  result  is  a  different  pivot  sequence  and  a  different  number  of  iterations  before  optimality,  making 
comparison  difficult. 

We  have  attempted  to  circumvent  this  problem  by  starting  both  the  original  and  reduced  problems 
with  the  same  feasible  basis.  This  basis  is  obtained  by  running  MINOS  on  the  original  problem  until  the 
first  feasible  basis  is  obtained.  Then,  in  theory  at  least,  both  the  original  and  reduced  problems  follow  the 
same  pivot  path  to  optimality,  so  that  any  time  dilfercnces  can  be  attributed  solely  to  increased  sparsity.  A 
drawback  to  this  approach  is  that  there  are  fewer  iterations  over  which  the  cost  of  running  SPARSER  can 
be  amortized. 

Also,  it  is  important  to  know  whether  reduced  problems  have  any  bias  towards  taking  more  or  fewer 
iterations  than  original  problems.  Before  we  became  aware  of  the  Phase  I  difficulty  discussed  above,  several 
pairs  of  original  and  reduced  problems  were  optimized  starting  from  a  (non-equivalent)  crash  basis.  No 
consistent  bias  in  iterations  was  observed,  but  more  formally  organized  experiments  arc  needed  to  determine 
whether  this  is  holds  in  general. 

The  results  of  the  comparative  MINOS  runs  are.  summarized  in  Table  3.5.3.  The  “%  redn.”  column  is 
copied  from  Table  3.5.2.  The  "orig.  time”  and  “rcduc.  time”  columns  give  the  total  time,  in  seconds,  for 
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Name 

i 

%  Rcdn. 
in  Non- 
Zeros 

Orig. 

Time 

(sec.) 

Reduc. 

Time 

(sec.) 

%  Redn. 
in  Time 
(corr.) 

%  Redn. 
in  Time 
(uncorr.) 

Itn. 
Ratio 
(or/ red) 

L152D 

7.16 

8.42 

7.8 

3.98 

5.06 

1.04 

L21LAV 

7.34 

9.81 

9.38 

4.18 

4.94 

—  _  ... 

1.00 

CAPRI 

8.09 

2.04 

2.00 

1.25 

2.19 

1.00 

CRACPB1 

8.37 

2.70 

1.93 

-7.39 

-4.69 

1.62 

SI1ARE2B 

8.50 

.55 

.53 

8.10 

9.57 

.95 

I.27LAV 

8.84 

12.72  j  12.60;  .10 

1.07 

1.00 

E226 

9.00 

7.04 '  6.80 1  3.20 

3.57 

1.00 

SHARE  IB 

13.47 

1.71  1.61  5.21 

6.36 

1.00 

BRANDY 

14.15 

2.13  i  1.98  1  6.16 

7.92 

1.00 

AIRFSTAR 

14.61 

5.83  5.90  6.87 

10.50 

.88 

L152LAV 

16.11 

j  _i 

7.95  !  6.84 '  .19 

3.20 

1.16 

LP4L 

. 

17.% 

_ 

3.80 !  3.32  i  5.94 

8.84 

1.07 

L94MAV 

21.02 

6.36  j  6.28  j  9.17 

11.65 

.90 

BAN  DM 

25.78 

4.31  j  3.68 

13.87 

15.76 

1.00 

BEACONED 

64.06 

.90 

.63 

9.42 

49.17 

1.00 

Tabic  3.5.3 


MINOS  to  solve  the  original  and  reduced  problems  respectively,  starting  from  the  same  initial  feasible  basis, 
with  MINOS  running  as  a  batch  job  on  VM. 

The  last  column  of  Table  3.5.3  gives  the  number  of  iterations  for  the  original  problem  divided  by  the 
number  of  iterations  for  the  reduced  problem.  It  shows  that  starting  the  original  and  reduced  problems  at 
the  same  feasible  basis  does  not  always  produce  the  same  number  of  iterations  in  practice;  where  they  differ, 
there  is  no  discernible  bias  favoring  either  the  original  or  reduced  problem. 

Columns  5  and  6  of  Table  3.5.3  give  estimated  percent  reductions  in  the  time  required  to  bring  the 
starting  feasible  basis  to  optimality.  An  estimate  (derived  from  SPARSlilt)  of  time  spent  inputting  the  MPS 
file  was  subtracted  from  the  times  reported  in  columns  3  and  4,  in  order  to  make  the  comparison  more  nearly 
reflect  actual  differences  in  time  per  iteration  (the  actual  time  spent  iterating  in  MINOS  is  not  currently 
available).  The  time  spent  in  MPS  input  depends  on  the  size  of  the  linear  program.  Since  a  matrix  reduced 
by  SI’ARSKIt  ran  have  considerably  fewer  non-zeros  than  the  original  matrix,  inputting  a  reduced  MPS  file 
can  lake  less  time  than  inputting  the  original  MPS  file.  The  MPS  input  time  used  to  calculate  the  percent 
reduction  in  column  5  has  been  “corrected”  by  the  ratio  of  number  of  reduced  non-zeros  to  the  number  of 
original  non-zeros  to  try  to  account  for  this  difference.  The  percent  reduction  on  column  G  has  not  been  so 
corrected. 

More  formally,  denote  the  total  original  optimization  time  from  column  3  of  Table  3.5.3  by  07’,  the 
total  reduced  time  from  column  A  by  RT,  the  MPS  input  time  from  8PARSKR  by  IT,  and  the  iteration  ratio 
from  column  7  by  r.  Then  the  value  in  columns  5  and  8  is 


(07*  -  IT)  -  r(RT  -  /  •  /7') 

or  -  it 

where  /  is  the  ratio  of  original  to  reduced  non-zeros  for  column  5,  and  is  1  for  column  6.  We  believe  that 
the  true  percent  reduction  in  time  lies  between  the  values  in  columns  5  and  6. 
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The  results  in  Table  3.5,3  show  that  reducing  the  number  of  non-zeros  in  these  problems  docs  indeed 
reduce  the  time  needed  to  optimize  them.  Though  the  results  have  a  large  amount  of  variability,  as  a  rule  of 
thumb  it  appears  that  the  percent  reduction  in  iteration  time  is  abouL  half  the  percent  reduction  in  non-zeros. 
In  this  light,  these  results  are  encouraging  since  they  bear  out  the  hope  that  problems  with  sparser  matrices 
can  be  optimized  faster.  Moreover,  SPARSER  is  sufficiently  effective  at  increasing  sparsity  that  the  CPU 
time  noticeably  decreased. 

Put  it  is  discouraging  to  compare  the  difference  in  original  and  reduced  MINOS  times  in  Table  3.5.3  with 
the  SPARSER  times  in  Table  3.5.2.  In  no  case  is  the  SPARSER  time  smaller  than  tne  overall  lime  saved  in 
MINOS.  Pre-processing  matrices  to  reduce  optimization  lime  is  not  helpful  if  the  time  saved  is  smaller  then 
the  time  spent  in  pre-processing. 

There  are  several  factors  that  convince  us  that  our  algorithm  will  eventually  prove  to  be  practically 
useful  in  spite  of  its  apparently  unhelpful  behavior  in  these  experiments.  One  factor  is  that  the  linear 
programs  that  were  tested  arc  relatively  small  and  solve  relatively  quickly  Another  factor  is  that  because 
of  the  way  the  experiments  were  set  up,  the  reduction  in  time  counts  only  the  Phase  II  iterations,  which 
typically  arc  about  half  of  the  total  iterations  from  a  cold  start. 

The  time  taken  by  the  algorithm  should  grow  more  slowly  with  increasing  problem  size  than  the  time 
taken  by  an  optimization  routine.  For  example,  llillicr  and  Uieborman  (1971).  p.  181,  state  that  the  solution 
time  of  linear  programs  is  usually  (){m 3),  whereas,  as  stated  in  SccLion  3.1,  SPARSER  uses  only  0(m2)  time. 
Thus  SPARSER  would  perform  better  on  larger  problems.  Also,  pre-processing  with  the  algorithm  is  a  fixed 
cost  that  saves  time  at  every  iteration  of  an  optimization,  and  hence  would  uo  of  higher  utility  on  more 
difficult  problems  that  take  relatively  more  iterations.  For  example,  the  lucky  linear  programs  arc  actually 
integer  programs.  Solving  integer  programs  typically  involves  solving  the  associated  linear  program  many 
times,  in  which  case  the  algorithm  would  be  more  useful. 

The  current,  experimental  version  of  SPARSER  spends  some  time  debugging  itself  and  accumulating 
statistics  on  its  performance.  A  more  streamlined  implementation  would  be  faster.  Also,  outputting  a 
reduced  MPS  file  from  SPARSER  for  input  to  MINOS  is  inefficient.  A  more  realistic  implementation  would 
integrate  SPARSER  into  MINOS,  thereby  eliminating  the  unnecessary  file-handling  time. 

On  balance,  the  tests  show  that  increasing  sparsity  is  possible  and  that  it  reduces  the  time  needed  to 
solve  a  problem.  It  remains  to  be  shown  (by  more  extensive  tests)  whether  our  algorithm  is  a  prar.tical  way 
of  increasing  sparsity.  The  test  results  so  far  are  not  overwhelmingly  encouraging,  but  they  do  suggest  that 
a  streamlined  version  of  SPARSER  may  prove  useful. 

3.6.  Conclusions  and  Extensions 

In  the  preceding  sections  we  have  argued  that  the  Sparsity  Problem  is  important,  and  one  way  to  attack 
it  has  been  explored.  The  key  to  our  approach  is  that  by  assuming  the  Matching  Property,  the  One  Row 
Sparsity  Problem  can  be  solved  (as  shown  in  Section  3.2).  The  resulting  One  Row  Algorithm  is  at  the  heart 
of  the  subsequent  development. 

The  Parallel  and  Sequential  Algorithms  of  Section  3.3  arc  not  much  more  than  clever  ways  of  applying 
the  One  Row  Algorithm  to  each  row.  Proving  their  correctness  is  not  trivial,  but  some  interesting  theoretical 
results  are  obtained  in  return. 

Section  3.1  shows  that  additional  effort  is  required  to  bring  the  Sequential  Algorithm  to  a  point  where  it 
can  be  applied  to  real-life  matrices.  The  practical  algorithm  described  there  seems  to  work  reasonably  well 
judging  by  the  computational  results  in  Section  3.5,  but  different  and  possibly  better  ways  of  implementing 
the  algorithm  are  possible. 

One  possibility  is  a  two-pass  algorithm  that  separates  the  combinatorial  and  numerical  parts  of  the 
algorithm,  as  follows.  Theorem  3.3. i  implies  that  the  parallel  U\  induce  an  ordered  decomposition,  the  SP 
Decomposition,  on  the  rows  of  A  (see  Section  3.3).  The  SP  Decomposition  can  be  found  by  applying  PA 
purely  combinalorially  to  A  and  observing  the  sparsity  pattern  of  T* . 

The  proof  of  Theorem  3.3.2  shows  that  the  effect  of  7’*  on  A  is  to  transform  “diagonal  blocks”  of  A 
(with  respect  to  some  matching)  into  diagonal  siibmalrices,  and  “subdiagonal  blocks”  (with  respect  to  the 
same  matching)  into  zero  submatrices.  The  pattern  of  these  diagonal  and  zero  blocks  is  closely  related  to  the 
sparsity  pattern  of  T  (see  the  example  in  the  proof  of  Theorem  3.3.2).  What  is  happening  is  that  the  fixed 
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matching  picks  out  a  square  subinatrix  B  of  A.  The  matrix  T*  is  composed  of  various  pieces  of  B~l,  the 
particular  pieces  being  determined  by  the  SP  Decomposition  of  A.  Thus  A  —  T  A  is  the  result  of  performing 
only  part  of  the  Gauss-Jordan  reduction  of  A  that  turns  B  into  a  diagonal  matrix. 

Since  the  SP  Decomposition  can  lie  determined  combinatorial^,  it  can  be  known  in  advance  which 
part  of  the  Gauss-Jordan  reduction  of  A  to  perform.  Instead  of  letting  a  fixed,  combinatorially-chosen 
matching  determine  the  submatrix  B  at  the  start,  a  partial  Gauss-Jordan  elimination  could  be  performed 
with  dynamic,  numerically-controlled  pivot  choices.  The  pivot  choices  would  then  implicitly  determine  the 
submatrix  B,  but  only  after  the  numerical  operations.  Such  an  algorithm  would  clearly  be  theoretically 
optimal  since  it  eliminates  as  many  non-zeros  from  A  as  PA  does. 

More  importantly,  a  two- pass  algorithm  is  also  practically  implemenluble.  Since  its  operations  are  driven 
by  numerical  choices  rather  than  by  combinatorial  choices,  no  difficulties  would  be  encountered  in  applying 
it  to  matrices  without  (MP).  It  might  also  have  an  advantage  in  efficiency  over  SA. 

To  see  how  such  an  advantage  would  arise,  note  that  the  goal  of  Gauss-Jordan  elimination  is  to  reduce 
the  partitioned  matrix  (B  C)  (where  B  is  square  and  non-singular)  to  (/  B~1C).  In  the  dense  case, 

Gauss-Jordan  elimination  takes  0(n3)  operations.  The  Sequential  Algorithm  way  to  achieve  the  reduction 
is  to  solve  for  the  multipliers  that  reduce  each  row  separately,  even  though  the  multipliers  are  interrelated. 
Since  one  such  solve  uses  0(n3)  operations  in  the  dense  case,  this  would  give  an  0(n'1 )  algorithm  overall. 
Similar  savings  might  occur  in  the  sparse  case  as  well. 

There  are  two  difficulties  with  the  two-pass  algorithm  that  have  prevented  its  implementation  and 
comparison  with  SA.  The  first  is  that  we  prefer  a  more  elegant  way  of  computing  the  SP  Decomposition 
than  running  PA  combi natorially.  The  whole  point  of  the  SP  Decomposition  is  that  the  U,  are  themselves 
interrelated,  and  hence  there  should  be  a  way  to  use  the  relationships  to  help  compute  the  II* .  The 
ideal  algorithm  would  globally  develop  the  SP  Decomposition,  rather  than  generating  it  row  by  row  as 
the  combinatorial  PA  docs.  It  would  bo  csthetically  pleasing  if  the  combinatorial  phase  were  to  parallel  the 
block  (as  opposed  to  row  by  row)  nature  of  the  numerical  phase.  However,  such  an  algorithm  is  not  yet  at 
hand. 

The  second  difficulty  is  that  a  practical  implementation  of  the  two-pass  algorithm  would  require  its 
own  numerical  subroutine  to  perform  the  partial  Gauss-Jordan  reduction.  The  existence  of  MA28  allowed 
implementation  of  SPARSER  in  a  relatively  short  period  of  time.  (Indeed,  a  private  communication  with  Duff 
reveals  that  SPARSER  is  the  only  application  of  the  capabilities  for  solving  rectangular  systems  in  MA28  of 
which  he  is  aware.)  Writing  such  a  piece  of  software  so  that  it  is  efficient,  takes  full  advantage  of  sparsity 
and  is  numerically  stable  is  a  momentous  undertaking. 

There  is  another  possible  strategy  which  may  improve  the  practicality  of  the  algorithm.  The  sparsity 
of  A  is  globally  improved  in  the  hope  that  on  average  the  bases  are  then  sparser.  The  objective  of  the 
Sparsity  Problem  implicitly  assumes  that  any  column  is  equally  likely  to  appear  in  a  basis.  In  some  situations 
there  is  a  priori  knowledge  about  which  columns  are  more  likely  to  appear  in  a  basis  than  other  columns. 
Past  experience  or  physical  considerations  of  a  model  might  lead  to  such  knowledge.  Alternatively,  in  an 
optimization  with  many  iterations,  the  frequencies  with  which  each  column  appears  in  a  basis  could  be 
recorded  in  order  to  apply  an  algorithm  that  can  take  advantage  of  such  information  in  the  midst  of  the 
optimization.  Such  a  strategy  would  allow  the  sparsity  of  A  to  be  dynamically  adjusted  to  reflect  information 
about  the  columns  that  arc  most  active  during  a  long  optimization  run. 

In  either  case,  the  problem  of  interest  is  the  Weighted  Sparsity  Problem  (WSP).  The  WSP  is  the 
same  as  the  regular  Sparsity  Problem  except  each  column  has  a  weight  which  represents  an  estimate  of 
how  likely  that  column  is  to  be  in  a  basis.  The  objective  of  WSP  is  to  transform  A  into  an  A  which  has  a 
minimum  weighted  number  of  non-zeros.  WSP  is  another  area  for  future  research. 

As  mentioned  at  the  end  of  Section  3.5,  the  computational  results  are  only  indicative,  not  conclusive. 
It  would  be  very  interesting  to  assemble  a  large  collection  of  typical  large-scale  problems  with  which  to  test 
SPARSER,  and  to  address  the  following  questions.  Are  the  reductions  achieved  by  SPARSER  in  Table  3.5.2 
typical?  Is  there  any  association  between  the  form  of  an  optimization  problem  and  SPAKSER’s  performance 
on  it?  (None  appears  obvious  from  Table  3.5.2,  but  perhaps  there  is  a  link  with  the  source  of  the  model  or 
form  of  the  matrix.)  Are  the  conclusions  drawn  from  Table  3.5.1  still  valid  on  other  matrices,  particularly 
those  with  a  higher  ratio  or  “lucky  gain”  to  “combinatorial  gain?”  Is  there  a  stronger  association  between 
reduction  in  non-zeros  and  speed-up  in  optimization  time  than  is  evident  in  Table  3.5.3?  Finally,  and  most 


Bibliography 


Bixby,  R.  and  W.  Cunningham  (1983).  Converting  linear  programs  to  network  problems,  to  appear  in  Math, 
of  OR. 

Bollabas,  B.  (1978).  I'lxtremal  Graph  Theory,  Academic  Press,  London  and  New  York. 

Bondy,  J.  A.  and  U.  S.  R.  Murty  (1976).  Graph  Theory  with  Applications,  MacMillan,  London. 

Brent,  R.  P.  (1973).  Reducing  the  retrieval  time  of  scatter  storage  techniques,  Comin.  ACM,  16,  pp.  105-109. 

Coleman,  T.  1’.  and  J.  J.  More  (1981).  Estimation  of  sparse  Jacobian  matrices  and  graph  coloring  problems, 
Report  ANL-81-39,  Argonne  National  Laboratory,  Argonne,  IL. 

Coleman,  T.  F.  and  J.  J.  More  (1982).  Estimation  of  sparse  Hessian  matrices  and  graph  coloring  problems, 
Cornell  University  Department  of  Computer  Science  Report  TR  82-535,  Ithaca,  NY. 

Cottle,  R.  W.  (1974).  Manifestations  of  the  Schur  Complement,  Linear  Algebra  a r.d  its  Applications,  8,  pp. 
182-211. 

Curtis,  A.  R.,  M.  J.  D.  Powell  and  J.  K.  Reid  (1974).  On  the  estimation  of  sparse  Jacobian  matrices,  Journal 
of  the  Institute  of  Mathematics  and  its  Applications,  13,  pp.  117-119. 

Dahlquist,  G.  and  A.  Bjorck  (1974).  Numerical  Methods,  Prenticc-llall,  Englewood  Cliffs,  NJ. 

Dantzig,  G.  B.  (1963).  Linear  Programming  and  Intensions,  Princeton  University  Press,  Princeton,  NJ. 

Dantzig,  G.  B.,  B.  Avi-ltzhak,  T.  J.  Connolly,  W.  D.  Winkler,  et  al.,  (1981).  PILOT-1980  energy-economic 
model,  volume  1,  Electric  Power  Research  Institute,  Report  EA-2090,  Stanford,  CA. 

Denckcr,  P.,  K.  Diirrc  and  J.  Hcuft  (1981).  Optimization  of  parser  tables  for  portable  compilers,  Universitat 
Karlsruhe  (nterner  Bericht  Nr.  22/81,  Karlsruhe. 

DufT,  I.  S.  (1977).  MA28  —  a  set  of  FORTRAN  subroutines  for  sparse  unsymmetric  linear  equations,  A.E.R.E. 
Harwell  Report  8730. 

Dulmage,  A.  L.  and  N.  S.  Mendelsohn  (1963).  Two  algorithms  for  bipartite  graphs,  J.  SIAM,  11,  pp.  183-194. 

Diirrc,  K.  and  G.  Pels  (1980).  Efficiency  of  sparse  matrix  storage  techniques,  in  Discrete  Structures  and 
Algorithms  (U.  Pape  cd.),  Miinchen-Wien,  pp.  209-221. 

Ford,  L.  R.  and  I)  R.  Fulkerson  (1962).  Plows  in  Networks,  Princeton  University  Press,  Princeton,  NJ. 

Carey,  M.  R.  and  I).  S.  Johnson  (1979).  Computers  and  Intractability,  Freeman,  San  Francisco,  CA. 

George,  J.  A.  arid  F.  C.  Gustavson  (1980)  A  new  proof  on  permuting  to  block  triangular  form,  IBM  RC 
Report  8238,  Yorktown  Heights,  NY. 

Gill,  P.  E.,  W.  Murray  and  M.  II.  Wright  (1981).  Practical  Optimization,  Academic  Press,  London  and  New 
York. 

Golurnbic,  M.  C.  (1980).  Algorithmic  Graph  Theory  and  Perfect  Graphs,  Academic  Press,  London  and  New 
York. 

Grimmet,  G.  R.  and  C.  J.  II.  McDiarmid  (1975).  On  colouring  random  graphs,  Proceedings  of  the  Cambridge 
Philosophical  Society,  77,  pp.  313  -324. 

Gustavson,  F.  G.  (1973).  Permuting  matrices  stored  in  sparse  format,  Disclosure  Number  8-72-001,  IBM 
Technical  Disclosure  Bulletin,  16,  1,  pp.  357-  359. 

Gustavson,  F.  G.  (1976).  Finding  the  block  triangular  form  of  a  sparse  matrix,  in  Sparse  Matrix  Compulations, 
Academic  Press,  New  York. 

Hausmann,  D.  and  B.  Korte(1981).  Algorithmic  versus  axiomatic  definitions  of  malroids,  Math.  Prog.  Study, 
14,  pp.  98  111. 

Hillier,  F.  S.  and  G.  J.  Licbcrman  (1974).  Operations  Research,  second  edition,  Holdcn-Day,  San  Francisco, 
CA. 

Hoffman,  A.  J.  (1982).  Persona!  communication. 

Hoffman,  A.  J.  and  S.  T.  McCormick  (1982).  A  fast  algorithm  that  makes  matrices  optimally  sparse,  Stanford 
University  Systems  Optimization  Laboratory  Report  82-13,  Stanford,  CA  (a  revised  version  is  to  appear 
in  the  Proceedings  of  the  Silver  Jubilee  Conference  on  Combinatorics,  University  of  Waterloo,  Waterloo, 
Ontario,  (1983)). 

Hopcroft,  J.  E.  and  R.  M.  Karp  (1973).  An  n5/2  algorithm  for  maximum  matching  in  graphs,  SIAM  J. 
Computing,  2,  4,  pp.  225  231. 


58 


Bibliography 


68 


Lri,  M.  (1983).  Structural  theory  for  the  combinatorial  systems  characterized  by  subinoduiar  functions,  to 
appear  in  the  Proceedings  of  the  Silver  Jubilee  Conference  on  Combinatorics,  University  of  Waterloo, 
Waterloo,  Ontario. 

Itai,  A.  and  M.  Rodeh  (1978).  Finding  a  minimum  circuit  in  a  graph,  SIAM  J.  Computing  7,  4,  pp.  413  423. 

Johnson,  D.  S.  (1974).  Worst  case  behavior  of  graph  coloring  algorithms,  in  Proceedings  of  the  bill  Southeastern 
Conference  on  Combinatorics,  Graph  Theory,  and  Computing,  Utilitas  Mathematica,  Winnipeg,  Manitoba, 
pp.  513-527. 

Karp,  R.  M.  (1972).  Reducibility  among  combinatorial  problems,  in  Complexity  of  Computer  Computations 
(R.  H.  Miller  and  J.  W.  Thatcher,  cds.),  Plenum,  New  York,  pp.  85  103. 

Kancko,  1.,  M.  Lawo  and  G.  Thicrauf  (1982).  On  computational  procedures  for  the  force  method,  In t.  J.  for 
Num.  Mcth.  in  lingr.,  18,  pp.  1469-1495. 

Knuth,  D.  K.  (1973).  The  Art  of  Computer  Programming,  Volume  1,  Second  Edition,  Addison- Wesley,  Menlo 
Park,  CA. 

Lawler,  K.  L.  (1976).  Combinatorial  Optimization,  Holt,  Rinehart  and  Winston,  New  York. 

Markowitz,  H.  M.  (1957).  The  elimination  form  of  the  inverse  and  its  application  to  linear  programming, 
Management  Science,  3,  pp.  255-269. 

McCormick,  S.  T.  (1981).  Optimal  approximation  of  sparse  Hessians  and  its  equivalence  to  a  graph  coloring 
problem,  Stanford  University  Systems  Optimization  Laboratory  Report  SOL  81-22,  Stanford,  CA  (a 
revised  version  appears  in  Math.  Prog.,  26  (1983),  pp.  153-171). 

McCormick,  S.  T.  (1983).  A  combinatorial  approach  to  some  sparse  matrix  problems,  Ph.  D.  Thesis,  Stanford 
University,  Stanford,  CA. 

Murtagh,  H.  A.  (1981).  Advanced  Linear  Programming  and  Practice,  McGraw-Hill,  New  York. 

Newsam,  G.  N.  and  J.  D.  Ramsdcll  (1982).  Estimation  of  sparse  Jacobian  matrices,  Harvard  University 
Division  of  Applied  Sciences  Report  TR- 17-81,  Cambridge,  MA. 

Papadimitriou,  C.  H.  and  K.  Sticglitz  (1982).  Combinatorial  Optimization:  Algorithms  and  Complexity, 
Prentice  Hall,  Englewood  Cliffs,  NJ. 

Peters,  G.  and  J.  II.  Wilkinson  (1970).  The  least  squares  problem  and  pseudo- inverses,  Computer  J.,  13,  pp. 
309  316. 

Powell,  M.  J.  D.  ami  P.  L.  Toint  (1979).  On  the  estimation  of  sparse  Hessian  matrices,  SIAM  Journal  on 
Numerical  Analysis,  16,  pp.  1060  1074. 

Ryser,  11.  J.  (1963).  Combinatorial  Mathematics,  MAA  Camus  Mathematical  Monograph  Number  14, 
Providence,  RI. 

Saks,  M.  and  J.  Kahn  (1983).  Personal  communication. 

Saunders,  M.  A.  (1977).  MINOS  system  manual,  Stanford  Systems  Optimization  Laboratory  Report  SOL 
77-31,  Stanford,  CA. 

Stockmeycr,  L.  J.  (1982).  Personal  communication. 

Thapa,  M.  N.  (1980).  Optimization  of  unconstrained  functions  with  sparse  Hessian  matrices,  Ph.  D.  Thesis, 
Stanford  University,  Stanford,  CA. 

Welsh,  D.  (1976).  Matroid  Theory,  Academic  Press,  London  and  New  York. 

Widgcrson,  A.  (1982).  A  new  appproxirnate  graph  coloring  algorithm,  Proceedings  of  the  Fourteenth  Annual 
ACM  Symposium  on  Theory  of  Computing. 


SECURITY  CLASSIFICATION  OF  this  PAOE  (Who n  Dim  SnferaaB 


REPORT  DOCUMENTATION  PAGE 


REPORT  NUMBER 

SOL  83-5 


4.  TITLE  /and  SubHIlo) 

A  COMBINATORIAL  APPROACH  TO  SOME  SPARSE 
MATRIX  PROBLEMS 


7.  author^; 

S.  Thomas  McCormick 


PERFORMING  ORGANIZATION  NAME  AND  AOORESS 

Department  of  Operations  Research  -  SOL 
Stanford  University 
Stanford,  CA  94305 

Office  or  Naval  Research  -  Dept,  of  the  Navy 
800  N.  Quincy  Street 
Arlington,  VA  22217 


14.  MONITORING  AGENCY  NAME  4  ADDRESS/!/  ditlotont  1 

U.S.  Army  Research  Office 
P.0.  Box  12211 

Research  Triangle  Park,  NC  27709 


<6.  DISTRIBUTION  STATEMENT  (at  thlm  Report) 


i  Controlling  Ollleo) 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


S.  RECIPIENT'S  CATALOG  NUMBER 


*•  TYPE  OF  REPORT  *  PERfOO  COVEREO 


Technical  Report 


PERFORMING  ORG.  REPORT  NUMBER 


S.  CONTRACT  OR  GRANT  NUMBER/aJ 

N00014-75-C-0267 

DAAG29-81-K-0156 


10.  PROGRAM  ELEMENT.  PROJECT,  TASK 
AREA  4  WORK  UNIT  NUMBERS 

NR-047-143 


1Z.  REPORT  DATE 

3une  1983 


*3.  NUMBER  OF  PAGES 

69  PP 


IS.  SECURITY  CLASS,  for  M/a  report; 

UNCLASSIFIED 


15*  DECLASSIFICATION/ DOWNGRADING 
SCHEDULE 


This  document  has  been  approved  for  public  release  and  sale; 
its  distribution  Is  unlimited. 


17.  DISTRIBUTION  STATEMENT  (ml  Me  ab afreet  enter od  In  Blook  30.  If  dlHotont  tnm  Report; 


20.  ABSTRACT  (Continue  on  rereree  ml  dm  II  nmcmmmmry  tnd  Identity  by  feloelt  miller; 


19.  KEY  WORDS  (C onilnuo  on  rereree  aide  II  nmcmmmmry  an d  Identity  by  block  man bar; 

sparse  Hessian  approximation  NP-complete 

graph  coloring  heuristics  direct  methods 

sparse  matrices  elimination  methods 

bipartite  matching  linear  constraints 

computational  complexit 


SEE  ATTACHED 


FORM 
I  JAN  73 


EDITION  OF  1  NOV  ••  IS  OBSOLETE 


-—UNITY  CLASSIFICATION  OF  TNIS  PAGE  (When  1 


SECURITY  CLASSIFICATION  OF  THIS  PAO£(1Wi«n  P*«  EnCt*d) 


SOL  83-5,  "A  COMBINATORIAL  APPROACH  TO  SOME  SPARSE  MATRIX  PROBLEMS," 
by  S.  Thomas  McCormick 

This  dissertation  considers  two  combinatorial  problems  arising  in  large- 
scale,  sparse  optimization.  The  first  is  the  problem  of  approximating  the 
Hessian  matrix  of  a  smooth,  non-linear  function  by  finite  differencing, 
where  the  object  is  to  minimize  the  required  number  of  gradient  evalua¬ 
tions.  The  second  is  to  find  as  sparse  a  representation  as  possible  of  a 
given  set  of  linear  constraints. 

For  the  first  problem,  it  has  recently  been  realized  that  when  the  Hessian 
has  a  fixed,  known  sparsity  pattern,  a  considerable  reduction  in  gradient 
evaluations  can  often  be  achieved  by  a  suitable  choice  of  difference  direc¬ 
tions.  This  dissertation  advances  a  way  of  classifying  the  various  methods 
that  have  been  proposed  for  choosing  difference  directions,  and  shows  that 
finding  an  optimally  small  set  of  directions  for  any  of  the  four  sub- 
varieties  of  the  Direct  Methods  is  NP-Complete.  The  complexity  results  are 
obtained  by  showing  that  finding  optimal  sets  of  difference  directions  is 
equivalent  to  related  graph  coloring  problems.  Some  results  for  more 
general  methods  are  reported  that  yield  good  lower  bounds  on  the  minimum 
number  of  gradient  evaluations  needed  to  approximate  many  Hessians. 

The  second  problem  has  been  shown  to  be  NP-Complete.  By  adopting  a  fairly 
mild  non-degeneracy  assumption  we  are  able  to  derive  a  low-order  polynomial 
algorithm  which  reduces  given  constraints  into  an  optimally  sparse 
equivalent  set  of  constraints.  This  algorithm  is  based  on  bipartite  match¬ 
ing  theory,  and  it  induces  a  partial  order  on  the  rows  of  the  matrix  which 
is  related  to  Dulmage-Mendelsohn  decomposition.  The  proof  that  the 
algorithm  is  correct  yields  a  performance  guarantee  when  the  algorithm  is 
applied  to  real  data,  and  several  modifications  that  improve  its  running 
time  are  discussed.  Some  computational  experience  is  presented  which 
indicates  that  the  algorithm  may  be  practically  useful  as  a  preprocessor 
for  linearly  constrained  optimization.  We  also  discuss  the  relationship  of 
this  research  to  finding  optimally  sparse  null-space  bases,  and  to  the 
complexity  of  matroid  oracles. 
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